In [1]:
import re
import time
from typing import Iterable, List, Dict, Optional

import pandas as pd
pd.set_option('display.max_columns', 150)

import requests
from lxml import html, etree

import math

# Data Collection for the Irish Schools Collection

This notebook implements a complete, reproducible pipeline for constructing a structured research dataset from the *Irish Schools Collection* hosted on Dúchas (<https://www.duchas.ie>). The objective is to transform the raw collection—distributed across hundreds of school-level landing pages and thousands of semi-structured HTML/XML documents—into a clean, analyzable dataset that can support downstream work in cultural analytics, historical research, linguistic variation, geospatial modeling, and computational folklore studies.

## Background: The Irish Schools Collection

Between 1937 and 1939, more than 100,000 Irish schoolchildren participated in a remarkable grassroots folklore project coordinated by the Irish Folklore Commission. Pupils interviewed elders in their families and communities, recording stories, customs, superstitions, local histories, material culture, folk medicine, and everyday beliefs. The resulting manuscripts—over half a million pages—constitute one of the largest folk-ethnographic collections in Europe.

The digitized archive hosted by Dúchas provides access to scanned pages, HTML transcriptions, and TEI/XML representations. Journalistic and scholarly analyses have emphasized the archive’s breadth and strangeness—including the *Irish Times* profile of “Ireland’s darkest, oddest, and weirdest secrets” (<https://www.irishtimes.com/life-and-style/people/ireland-s-darkest-oddest-and-weirdest-secrets-uncovered-1.3418059>) and the National Folklore Collection’s historical introduction to the Schools Collection (<https://www.duchas.ie/download/17.01.26-irish-folklore-and-tradition.pdf>). Researchers have also drawn on this material for domain-specific studies, such as network analyses of folk-medicinal knowledge (e.g., *Frontiers in Pharmacology*, <https://www.frontiersin.org/articles/10.3389/fphar.2020.584595/full>).

For computational work, the archive is both an opportunity and a technical challenge: its metadata are spread across multiple HTML views, XML endpoints, and external services such as Logainm (the official Irish placenames database). Moreover, its structure is hierarchical:

- **School → Page → Item**,  
- with each level containing partially overlapping metadata.

This notebook constructs the foundational dataset required to navigate this multi-level structure.


___________
## Overview of the pipeline

The workflow implemented here has five stages, corresponding to the major sections of the notebook:

1. Construct a school-level dataset by crawling the schools index pages and extracting basic metadata for each school.
2. Enrich each school with additional metadata from the Dúchas XML endpoints and associated Logainm entries.
3. Use this enriched school dataset as a basis for collecting item-level titles and item URLs for stories, essays, and other materials.
4. For each item, retrieve and parse both the HTML and XML representations to extract text and structured features.
5. Estimate request volume and runtime, and benchmark subsets of the pipeline to ensure that the full crawl is both feasible and polite to the host servers.

The rest of the notebook is organized as follows:


### Section 1: Schools index crawl (`df_schools`)

In Section 1, we crawl all publicly available Schools Collection index pages:

```
https://www.duchas.ie/en/cbes/schools?page=<page>
```

From each page we extract:

* a unique school identifier (SchoolID)
* the school URL on Dúchas
* the school name
* any visible CBES volume identifier (for example, “CBES 0038C”)
* the reported percentage of material transcribed
* the raw card text for reference

We also implement a small discovery routine to find the actual range of valid index pages (rather than hard-coding an upper bound), and we add basic deduplication. The output of this stage is a dataframe, `df_schools`, with one row per school.


### Section 2: School XML and Logainm enrichment (`df_schools_enriched` / `df_schools_full`)

In Section 2, we enrich each school using its XML endpoint:

```
https://www.duchas.ie/xml/cbes/<SchoolID>
```

For each school we:

* fetch and parse the XML document
* extract teacher names (Irish and English, when available)
* extract the school roll number
* detect any Logainm URL associated with the school
* fetch the corresponding Logainm page and, when present, extract a WKT geometry from data-wkt attributes

This stage uses a single HTTP session with retry logic, backoff for transient errors and 429 rate limits, and a small throttle for politeness. The result is an enriched school-level dataframe, df_schools_enriched, which is then merged back into the original `df_schools` to produce `df_schools_full`. This enriched table is the main backbone for any spatial analyses or school-level comparisons.


### Section 3: Item index extraction (`df_items_index`)

In Section 3, we move from schools to items. Each school’s landing page links to the individual items (stories, essays, reports, etc.) recorded there. These links typically encode school, page, and item identifiers in the URL.

Here we:

* visit each school’s CBES page
* parse all relevant item links
* extract SchoolID, PageID, ItemID, the full item URL, and the displayed item title

These rows are combined into a single dataframe, df_items_index, which indexes all items in the collection that are reachable from the schools pages. This representation (SchoolID, PageID, ItemID, ItemURL, ItemTitle) provides the basic item-level graph on top of the school-level backbone.


### Section 4: Per-item HTML and XML scraping (`df_items_full`)

In Section 4, we enrich each item in `df_items_index` using its HTML and XML representations.

For each item we:

* fetch the HTML item page and extract:

  * the main display title on the item page (which may differ from the index title)
  * the main transcribed text
  * any other easily accessible visible metadata
* construct the parallel XML URL for the item and fetch it
* parse the XML to extract basic structural and linguistic features, such as:

  * token counts (for example, counts of word elements)
  * language metadata
  * any other TEI fields we choose to include

The result is a dataframe df_items_full that combines the index-level identifiers with HTML- and XML-derived fields. This is the table you would use for most downstream text analysis, narrative modeling, or linguistic work.


### Section 5: Request budgeting and benchmarking

In Section 5, we treat the pipeline as an engineering project and ask: how expensive is a full run?

We:

* estimate the number of HTTP requests implied by a full item-level scrape (HTML plus XML per item)
* benchmark the per-item enrichment on a small random subset of items to measure time per item
* project total runtime for the full df_items_index
* use these estimates to choose safe throttling parameters and, if necessary, to split the pipeline into batches

This makes the computational cost and network load explicit and helps ensure the pipeline is both reproducible and responsible.


## Summary of core dataframes

By the end of this notebook, the main “products” of the pipeline are:

* `df_schools`: basic school-level metadata from the index crawl
* `df_schools_enriched` and `df_schools_full`: school-level metadata enriched with XML fields and Logainm-derived geography
* `df_items_index`: an index of all items per school, with item URLs and titles
* `df_items_full`: item-level records with text and structured metadata from both HTML and XML

Together, these tables form a coherent, relational dataset for the Irish Schools Collection that is suitable for geospatial analyses, network analyses of narratives or informants, language variation studies, and broader computational and digital humanities research on Irish folklore.


__________
## 1. Schools index crawl

The goal of this section is to construct a school-level dataframe `df_schools` by crawling the public index pages under

`https://www.duchas.ie/en/cbes/schools?page=<page>`

Each index page contains a set of "cards" or list items, one per school. From each card we extract:

- a unique school identifier,
- the school URL on Dúchas,
- the school name,
- the CBES volume identifier, and
- the reported percentage of the material that has been transcribed.

This section is deliberately self-contained: it defines helper functions to build index URLs, parse a single index page into a list of rows, and then loop over all pages to build a single dataframe.


In [2]:
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

SCHOOLS_SESSION = requests.Session()
SCHOOLS_SESSION.headers.update(HEADERS)


In [3]:
BASE_URL = "https://www.duchas.ie"
SCHOOLS_INDEX_TEMPLATE = BASE_URL + "/en/cbes/schools?Page={page}&PerPage=20"
SCHOOL_PAGE_RANGE = range(1, 226)  # example: pages 1–325


def build_school_index_url(page):
    """
    Build the URL for a single schools index page.

    Parameters
    ----------
    page : int
        Page number as used in the Dúchas schools index.

    Returns
    -------
    str
        Fully-qualified URL for the given index page.
    """
    return SCHOOLS_INDEX_TEMPLATE.format(page=page)


def _clean_spaces(text):
    """
    Normalize whitespace within a text block.

    Collapses runs of whitespace and strips leading and trailing space.
    """
    return re.sub(r"\s+", " ", text or "").strip()


In [None]:
SCHOOL_LINK_RE = re.compile(
    r"^/(?:en|ga)/cbes/(?P<SchoolID>\d+)(?:/)?(?:\?.*)?$",
    re.IGNORECASE,
)

VOLUME_RE = re.compile(
    r"CB(?:E|É)S[\s\-]*([0-9A-Z]{3,5})",
    re.IGNORECASE
)

PCT_RE = re.compile(
    r"(\d{1,3})\s*%"
)

NAME_RE = re.compile(
    r"(?:School|Scoil)\s*:\s*(.*?)\s*CB(?:E|É)S\b",
    re.IGNORECASE,
)

### 1.1 Parsing a single schools index page

Each index page is parsed into a list of dictionaries, one per school card. The code below uses a small set of regular expressions to recognize:

- the school ID from the link URL,
- the CBES volume identifier from the card text, and
- the percent transcribed from the card text.

The HTML structure on Dúchas may change in the future, but the logic here is intended to be explicit and easy to adjust when that happens.


In [6]:
def extract_school_cards(list_url):
    """
    Parse a Dúchas schools index page into a list of row dictionaries,
    with simple retry + backoff to handle 429 (Too Many Requests) responses.

    Parameters
    ----------
    index_url : str
        URL for a schools index page.

    Returns
    -------
    List[dict]
        List of dictionaries with keys such as 'SchoolID', 'SchoolURL',
        'SchoolName', 'VolumeRaw', and 'PctTranscribedRaw'. If all attempts
        fail, returns an empty list and logs a warning.
    """
    r = SCHOOLS_SESSION.get(list_url, timeout=60)
    r.raise_for_status()
    doc = html.fromstring(r.content)

    rows = []
    seen = set()

    # KEY: use <li> cards, not generic //a
    lis = doc.xpath("//li[.//a[contains(@href,'/cbes/')]]")

    for li in lis:
        anchor = None
        sid = None
        url = None

        # Find the first <a> that matches SCHOOL_LINK_RE
        for a in li.xpath(".//a[@href]"):
            href = a.get("href") or ""
            m = SCHOOL_LINK_RE.match(href)
            if not m:
                continue
            sid = int(m.group("SchoolID"))
            url = href if href.startswith("http") else f"{BASE_URL}{href}"
            anchor = a
            break

        if sid is None or anchor is None or sid in seen:
            continue
        seen.add(sid)

        # Full visible text of LI card
        card_text = _clean_spaces(" ".join(li.xpath(".//text()")))

        # Name extraction (your old rules)
        name = None
        mname = NAME_RE.search(card_text)
        if mname:
            name = _clean_spaces(mname.group(1))
        else:
            atext = _clean_spaces(" ".join(anchor.itertext()))
            name = _clean_spaces(re.split(r"\s*CB(?:E|É)S\b", atext)[0] or atext)

        # Volume number
        vol_num = None
        mvol = VOLUME_RE.search(card_text)
        if mvol:
            vol_num = mvol.group(1)

        # Percent transcribed
        percent = None
        mp = PCT_RE.search(card_text)
        if mp:
            try:
                v = int(mp.group(1))
                if 0 <= v <= 100:
                    percent = v
            except ValueError:
                pass

        rows.append({
            "SchoolID": sid,
            "SchoolURL": url,
            "SchoolName": name or None,
            "VolumeNumber": vol_num,
            "PercentTranscribed": percent,
            "ListPage": list_url,
        })

    return rows


### 1.2 Quick sanity check on a single page

Before crawling all pages, it is helpful to inspect the parsed output for a single index page to confirm that the regular expressions and HTML selectors behave as expected.


In [7]:
test_page = next(iter(SCHOOL_PAGE_RANGE))
test_url = build_school_index_url(test_page)

test_rows = extract_school_cards(test_url)
pd.DataFrame(test_rows).head()

Unnamed: 0,SchoolID,SchoolURL,SchoolName,VolumeNumber,PercentTranscribed,ListPage
0,4606380,https://www.duchas.ie/en/cbes/4606380?Route=sc...,Cill Éinne,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
1,4606492,https://www.duchas.ie/en/cbes/4606492?Route=sc...,Fearainn an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
2,4613911,https://www.duchas.ie/en/cbes/4613911?Route=sc...,Fearann an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
3,4602668,https://www.duchas.ie/en/cbes/4602668?Route=sc...,Inis Oirthir (Inisheer),1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
4,4602669,https://www.duchas.ie/en/cbes/4602669?Route=sc...,Breac-chluain,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...


### 1.3 Crawl all schools index pages

The function below loops over all index pages, calls `extract_school_cards` on each, and concatenates the results into a single dataframe `df_schools`. A small sleep is inserted between requests to avoid overloading the server. The runtime and number of collected rows are reported at the end.


In [8]:
def crawl_all_schools(pages, throttle_sec=0.15):
    """
    Crawl all schools index pages and return a combined dataframe.

    Parameters
    ----------
    pages : iterable of int
        Page numbers to crawl.
    throttle_sec : float, optional
        Sleep time between successive HTTP requests, by default 0.15 seconds.

    Returns
    -------
    pd.DataFrame
        Combined, de-duplicated dataframe of schools.
    """
    all_rows: List[Dict] = []

    start_time = time.time()
    for i, page in enumerate(pages, start=1):
        url = build_school_index_url(page)
        rows = extract_school_cards(url)
        all_rows.extend(rows)

        if i % 10 == 0:
            elapsed = time.time() - start_time
            print(f"[{i} pages] last page={page}, total rows so far={len(all_rows)}, elapsed={elapsed:0.1f}s")

        time.sleep(throttle_sec)

    df = pd.DataFrame(all_rows)

    # De-duplicate by SchoolID, keeping the first occurrence.
    if not df.empty:
        df = df.drop_duplicates(subset=["SchoolID"]).reset_index(drop=True)

    return df


In [9]:
# As observed on 2025-11-28: pages 1–225 return schools; 226+ are 404s.
SCHOOL_PAGE_RANGE = range(1, 226)

df_schools = crawl_all_schools(SCHOOL_PAGE_RANGE, throttle_sec=1.3)

print(f"Collected {len(df_schools)} unique schools.")
print(df_schools.shape[0])
df_schools.head()

[10 pages] last page=10, total rows so far=200, elapsed=13.6s
[20 pages] last page=20, total rows so far=400, elapsed=28.0s
[30 pages] last page=30, total rows so far=600, elapsed=42.4s
[40 pages] last page=40, total rows so far=800, elapsed=56.9s
[50 pages] last page=50, total rows so far=1000, elapsed=71.4s
[60 pages] last page=60, total rows so far=1200, elapsed=86.2s
[70 pages] last page=70, total rows so far=1400, elapsed=101.4s
[80 pages] last page=80, total rows so far=1600, elapsed=116.1s
[90 pages] last page=90, total rows so far=1800, elapsed=130.9s
[100 pages] last page=100, total rows so far=2000, elapsed=145.5s
[110 pages] last page=110, total rows so far=2200, elapsed=160.1s
[120 pages] last page=120, total rows so far=2400, elapsed=174.8s
[130 pages] last page=130, total rows so far=2600, elapsed=189.6s
[140 pages] last page=140, total rows so far=2800, elapsed=204.3s
[150 pages] last page=150, total rows so far=3000, elapsed=218.8s
[160 pages] last page=160, total rows 

Unnamed: 0,SchoolID,SchoolURL,SchoolName,VolumeNumber,PercentTranscribed,ListPage
0,4606380,https://www.duchas.ie/en/cbes/4606380?Route=sc...,Cill Éinne,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
1,4606492,https://www.duchas.ie/en/cbes/4606492?Route=sc...,Fearainn an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
2,4613911,https://www.duchas.ie/en/cbes/4613911?Route=sc...,Fearann an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
3,4602668,https://www.duchas.ie/en/cbes/4602668?Route=sc...,Inis Oirthir (Inisheer),1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...
4,4602669,https://www.duchas.ie/en/cbes/4602669?Route=sc...,Breac-chluain,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...


In [10]:
print(df_schools.shape[0])

4484


__________

### 1.4 Post-processing and derived columns

For the enrichment stage we need two additional columns:

- `SchoolXML`: the XML endpoint for each school on Dúchas.
- `SchoolURLClean`: the school URL without transient query parameters or routes.

These derived columns are added here, and the resulting dataframe is saved to disk as the canonical school index table.


In [11]:
def add_school_derived_columns(df):
    """
    Add derived columns needed for later enrichment:

    - SchoolXML: XML endpoint for the school.
    - SchoolURLClean: cleaned version of the HTML URL without query parameters.
    """
    df = df.copy()

    # XML endpoints follow a simple pattern; adjust if needed.
    df["SchoolXML"] = "https://www.duchas.ie/xml/cbes/" + df["SchoolID"].astype(str)

    # Some SchoolURL values may have query parameters such as '?page=' or '?Route=schools'.
    # Here we drop everything after the first '?'.
    df["SchoolURLClean"] = df["SchoolURL"].str.split("?", n=1).str[0]

    return df


df_schools = add_school_derived_columns(df_schools)
print(df_schools.shape)
df_schools.head()


(4484, 8)


Unnamed: 0,SchoolID,SchoolURL,SchoolName,VolumeNumber,PercentTranscribed,ListPage,SchoolXML,SchoolURLClean
0,4606380,https://www.duchas.ie/en/cbes/4606380?Route=sc...,Cill Éinne,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...,https://www.duchas.ie/xml/cbes/4606380,https://www.duchas.ie/en/cbes/4606380
1,4606492,https://www.duchas.ie/en/cbes/4606492?Route=sc...,Fearainn an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...,https://www.duchas.ie/xml/cbes/4606492,https://www.duchas.ie/en/cbes/4606492
2,4613911,https://www.duchas.ie/en/cbes/4613911?Route=sc...,Fearann an Choirce,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...,https://www.duchas.ie/xml/cbes/4613911,https://www.duchas.ie/en/cbes/4613911
3,4602668,https://www.duchas.ie/en/cbes/4602668?Route=sc...,Inis Oirthir (Inisheer),1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...,https://www.duchas.ie/xml/cbes/4602668,https://www.duchas.ie/en/cbes/4602668
4,4602669,https://www.duchas.ie/en/cbes/4602669?Route=sc...,Breac-chluain,1,100,https://www.duchas.ie/en/cbes/schools?Page=1&P...,https://www.duchas.ie/xml/cbes/4602669,https://www.duchas.ie/en/cbes/4602669


In [12]:
# Adjust the path to fit your project structure.
output_path_schools = "../data/duchas_schools_index.csv"
df_schools.to_csv(output_path_schools, index=False)
print(f"Wrote schools index to {output_path_schools}")

Wrote schools index to ../data/duchas_schools_index.csv


### Note on the canonical school list

The Schools Collection index was scraped using the original HTML/regex logic (which returns the full known set of 4,484 schools). The result is stored in:

    ../data/duchas_schools_index.csv

This file contains the complete SchoolID universe and is used as the starting point for all subsequent stages (XML enrichment, Logainm lookups, item index, and item-level scraping). It ensures full coverage even if the public Dúchas index pages expose only a partial subset.


____________
## 2. Enrich schools with XML and Logainm metadata

The schools index provides only basic metadata about each school. Dúchas also exposes an XML endpoint for each school which contains additional information, including teacher names, roll numbers, and references to Logainm identifiers for places associated with the school. https://www.logainm.ie/en/

In this section we construct an enrichment pipeline that:

1. Retrieves the XML document for each school with a retry-aware HTTP client.
2. Extracts a small set of fields from the XML, such as teacher name and roll number.
3. Resolves any Logainm URL associated with the school and, when available, downloads the corresponding Well-Known Text (WKT) geometry for the place.

The result is a new dataframe `df_schools_enriched` that extends `df_schools` with additional columns. The code is written to be robust against transient HTTP errors and to be easy to adapt if the XML schema evolves.


In [25]:
import threading
import concurrent.futures as cf
from functools import lru_cache

from lxml import etree, html
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from typing import Dict, Any, Optional, List


### 2.1 Robust HTTP session and XML helpers

To avoid re-creating HTTP sessions for each request and to make the enrichment robust to transient errors, the code below uses a single `requests.Session` object with a retry policy and some small helper functions for fetching and parsing XML documents.


In [26]:
# Browser-like User-Agent used across all sessions
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

_thread_local = threading.local()

def _build_session() -> requests.Session:
    """
    Build a requests.Session with browser-like headers and retry logic,
    to be used in a thread-local way.
    """
    s = requests.Session()
    s.headers.update(HEADERS)
    try:
        retry = Retry(
            total=5,
            connect=5,
            read=5,
            backoff_factor=0.4,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=frozenset(["GET"]),
            raise_on_status=False,
        )
    except TypeError:
        # For older urllib3 versions
        retry = Retry(
            total=5,
            connect=5,
            read=5,
            backoff_factor=0.4,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=frozenset(["GET"]),
            raise_on_status=False,
        )

    adapter = HTTPAdapter(max_retries=retry, pool_connections=50, pool_maxsize=50)
    s.mount("https://", adapter)
    s.mount("http://", adapter)
    return s

def _session() -> requests.Session:
    """
    Return a thread-local Session, creating one if necessary.
    """
    if not hasattr(_thread_local, "session"):
        _thread_local.session = _build_session()
    return _thread_local.session


In [27]:
def _fetch_bytes(url: str, timeout: int = 30) -> bytes:
    """
    Fetch raw bytes from a URL using the thread-local session.
    """
    r = _session().get(url, timeout=timeout)
    r.raise_for_status()
    return r.content


def _string_xp(root: etree._Element, xp: str) -> Optional[str]:
    """
    Evaluate an XPath expression against an XML root and return a stripped string,
    or None if not found / empty.
    """
    try:
        s = root.xpath(xp)
        if isinstance(s, str):
            out = s.strip()
        elif not s:
            out = ""
        else:
            # Take the first value
            val = s[0]
            out = val.strip() if isinstance(val, str) else str(val).strip()
        return out or None
    except Exception:
        return None


### 2.2 Logainm WKT helper

Some school XML documents contain references to Logainm place records. The helper below downloads the corresponding HTML page from Logainm and tries to extract a WKT geometry from `data-wkt` attributes that are typically attached to Leaflet map elements. If no such attribute is present, the function returns `None`. This logic is deliberately conservative and can be refined later if needed.


In [31]:
@lru_cache(maxsize=4096)
def get_logainm_wkt(url: str) -> Optional[str]:
    """
    Given a Logainm (or related) page URL, attempt to extract a WKT geometry
    from a data-wkt attribute.
    """
    if not url:
        return None

    try:
        doc = html.fromstring(_fetch_bytes(url))
        # Primary: Leaflet container with data-wkt
        wkt = doc.xpath(
            "string(//*[contains(concat(' ', normalize-space(@class), ' '), "
            "' leaflet-container ')][@data-wkt][1]/@data-wkt)"
        )
        if not wkt:
            # Fallback: first element with data-wkt
            wkt = doc.xpath("string(//*[@data-wkt][1]/@data-wkt)")
        wkt = wkt.strip() if isinstance(wkt, str) else ""
        return wkt or None
    except Exception:
        return None


### 2.3 Enriching a single school

The function `enrich_school_row` takes a single row from `df_schools`, retrieves the associated XML document, and extracts a minimal set of fields. The exact XPath expressions depend on the Dúchas XML schema and can be refined as needed. At present we target:

- teacher names (if available),
- the default roll number, and
- an associated Logainm URL plus its WKT geometry.


In [43]:
def _parse_school_xml_to_row(school_xml_url: str) -> Dict[str, Any]:
    """
    Fetch a single SchoolXML and return a flat dict of fields.

    Expected columns include volume metadata, page metadata, school location
    (including Logainm WKT), teacher names, roll number, and some extra
    top-level meta fields adapted to the actual schema (schoolName, county).
    """
    row: Dict[str, Any] = {
        "SchoolXML": school_xml_url,
        # Volume metadata
        "volume_default": None,
        "volume_listingOrder": None,
        "volume_volumeNumber": None,
        "volume_volumeStatus": None,
        # Page metadata
        "page_default": None,
        "page_defaultURL": None,
        # SchoolLocation metadata
        "schoolLocation_default": None,
        "schoolLocation_nameGA": None,
        "schoolLocation_nameEN": None,
        "schoolLocation_lat": None,
        "schoolLocation_lon": None,
        "schoolLocation_county": None,
        "schoolLocation_url": None,
        "schoolLocation_polygon": None,
        # Teacher info
        "teacherName_pretext": None,
        "teacherName_text": None,
        "teacherName_nameKey": None,
        # Roll number
        "schoolRollNumber_default": None,
        # Extra meta from the XML body
        "SchoolNameXML": None,
        "RollNumber_XML": None,
        "VolumeNumberXML": None,
        "County_XML": None,
        "Parish_XML": None,
        "Barony_XML": None,
        "Townland_XML": None,
        # Error, if any
        "error": None,
    }

    try:
        root = etree.fromstring(_fetch_bytes(school_xml_url))

        # Volume info
        row["volume_default"]      = _string_xp(root, "string(//*[local-name()='volume']/@default)")
        row["volume_listingOrder"] = _string_xp(root, "string(//*[local-name()='volume']/@listingOrder)")
        row["volume_volumeNumber"] = _string_xp(root, "string(//*[local-name()='volume']/@volumeNumber)")
        row["volume_volumeStatus"] = _string_xp(root, "string(//*[local-name()='volume']/@volumeStatus)")

        # Page info
        row["page_default"]   = _string_xp(root, "string(//*[local-name()='page']/@default)")
        row["page_defaultURL"] = _string_xp(root, "string(//*[local-name()='page']/@url)")

        # School location
        row["schoolLocation_default"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@default)"
        )
        row["schoolLocation_nameGA"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@nameGA)"
        )
        row["schoolLocation_nameEN"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@nameEN)"
        )
        row["schoolLocation_lat"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@lat)"
        )
        row["schoolLocation_lon"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@lon)"
        )
        row["schoolLocation_county"] = _string_xp(
            root, "string(//*[local-name()='schoolLocation']/@county)"
        )

        loc_url = _string_xp(root, "string(//*[local-name()='schoolLocation']/@url)")
        if loc_url:
            loc_page = loc_url.replace("/xml/", "/en/")
            row["schoolLocation_url"] = loc_page
            row["schoolLocation_polygon"] = get_logainm_wkt(loc_page)

        # Teacher names
        row["teacherName_pretext"] = _string_xp(
            root, "string(//*[local-name()='teacherName']/@pretext)"
        )
        row["teacherName_text"] = _string_xp(
            root, "string(//*[local-name()='teacherName']/@text)"
        )
        row["teacherName_nameKey"] = _string_xp(
            root, "string(//*[local-name()='teacherName']/@nameKey)"
        )

        # Roll number
        row["schoolRollNumber_default"] = _string_xp(
            root, "string(//*[local-name()='schoolRollNumber']/@default)"
        )

        # Extra high-level meta adapted to the actual schema:
        # <schoolName default="Cill Éinne"/>
        row["SchoolNameXML"] = (
            _string_xp(root, "string(//*[local-name()='schoolName']/@default)")
            or _string_xp(root, "string(//*[local-name()='schoolName'])")
        )

        # RollNumber_XML and VolumeNumberXML are duplicates of existing fields
        row["RollNumber_XML"] = row["schoolRollNumber_default"]
        row["VolumeNumberXML"] = row["volume_volumeNumber"]

        # County is stored as an attribute on schoolLocation (e.g., county="GA")
        row["County_XML"] = row["schoolLocation_county"]

        # Parish / Barony / Townland do not appear in this schema snippet;
        # leave them as None for now.
        row["Parish_XML"] = None
        row["Barony_XML"] = None
        row["Townland_XML"] = None

    except Exception as e:
        row["error"] = str(e)

    return row


### 2.4 Enrich all schools

The function `enrich_schools` applies `enrich_school_row` to every row in `df_schools` and returns an enriched dataframe keyed by `SchoolID`. For simplicity and transparency, the implementation below uses a straightforward loop; it can be parallelized later with `concurrent.futures` if needed.

A small throttle is added between requests to avoid placing excessive load on the Dúchas and Logainm servers.

In [50]:
def enrich_schools(
    df_schools: pd.DataFrame,
    max_workers: int = 1,
    throttle_sec: float = 0.8,
) -> pd.DataFrame:
    """
    Enrich df_schools using SchoolXML URLs.

    Parameters
    ----------
    df_schools : pd.DataFrame
        Must contain 'SchoolXML'.
    max_workers : int
        Maximum number of worker threads. For politeness, keep this small
        (1–2) for this site.
    throttle_sec : float
        Sleep time after each XML fetch, in seconds.

    Returns
    -------
    pd.DataFrame
        df_schools merged with enrichment columns (one row per SchoolXML).
    """
    urls = list(df_schools["SchoolXML"].dropna().unique())
    out_rows: List[Dict[str, Any]] = []

    def work(u: str) -> Dict[str, Any]:
        r = _parse_school_xml_to_row(u)
        if throttle_sec:
            time.sleep(throttle_sec)
        return r

    start_time = time.time()
    total = len(urls)

    with cf.ThreadPoolExecutor(max_workers=max_workers) as ex:
        for i, row in enumerate(ex.map(work, urls), 1):
            out_rows.append(row)
            if i % 25 == 0 or i == total:
                elapsed = time.time() - start_time
                print(f"...processed {i}/{total} schools (elapsed {elapsed:.1f}s)")

    df_enrich = pd.DataFrame(out_rows)
    # Merge back on SchoolXML (one-to-one)
    df_full = df_schools.merge(df_enrich, on="SchoolXML", how="left")
    return df_full


In [51]:
def enrich_in_batches(
    df_schools: pd.DataFrame,
    batch_size: int = 250,
    max_workers: int = 1,
    throttle_sec: float = 1.0,
) -> pd.DataFrame:
    """
    Enrich df_schools in batches to be polite to the server and to allow
    partial progress if the process is interrupted.
    """
    enriched_chunks = []
    n = len(df_schools)

    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        print(f"Processing schools {start}–{end-1} of {n}")

        chunk = df_schools.iloc[start:end].copy()
        enriched_chunk = enrich_schools(
            chunk,
            max_workers=max_workers,
            throttle_sec=throttle_sec,
        )
        enriched_chunks.append(enriched_chunk)

        # Optional short cool-down between batches
        time.sleep(120)

    return pd.concat(enriched_chunks, ignore_index=True)


In [52]:
# df_schools loaded from CSV and has SchoolXML column
df_schools = pd.read_csv("../data/duchas_schools_index.csv")
df_schools["SchoolID"] = df_schools["SchoolID"].astype(str)

In [115]:
df_schools_full = enrich_in_batches(
    df_schools,
    batch_size=250,
    max_workers=1,    # keep this at 1 for now
    throttle_sec=2.0, # can relax to 0.8 later if things look stable
)

print("Enriched schools:", len(df_schools_full))
df_schools_full.head()


Once the enrichment has completed, the enriched data can be merged back into the original `df_schools` and written to disk as a single canonical table.


In [122]:
df_schools_full.to_csv("../data/duchas_schools_enriched.csv", index=False)
print("Wrote enriched schools table to ../data/duchas_schools_enriched.csv")

Wrote enriched schools table to ../data/duchas_schools_enriched.csv


___________
## 3. Per-school item titles

In this section, the goal is to move from a school-level dataset to an item-level index of stories, essays, and other materials for each school.

For each school, Dúchas exposes pages that list individual items, with links to the item view. Here we treat the school’s main CBES page as the entry point for collecting item links. We parse these pages, identify links to individual items, and construct a dataframe `df_items_index` with one row per item, containing (at minimum) school ID, page ID, item ID, item URL, and item title.

The exact HTML structure on Dúchas can change, so the XPath and URL patterns in this section should be considered a documented starting point that can be updated if the site evolves.


In [None]:
import concurrent.futures

# Item URLs on Dúchas CBES typically look something like:
#   /en/cbes/<SchoolID>/<PageID>/<ItemID>
# or a closely related pattern. The regex below captures those IDs.
ITEM_LINK_RE = re.compile(
    r"/en/cbes/(?P<SchoolID>\d+)/(?P<PageID>\d+)(?:/(?P<ItemID>\d+))?"
)


def build_school_items_url(row: pd.Series) -> str:
    """
    Construct the URL that lists items for a given school.

    At present we use the cleaned school URL itself as the entry point,
    assuming that the main CBES page contains links to all items associated
    with the school (possibly with pagination).

    If Dúchas provides a dedicated 'Titles/Teidil' view or uses specific
    query parameters, adjust this function to return the appropriate URL.
    """
    return row["SchoolURLClean"]


In [None]:
def extract_items_for_school(row: pd.Series) -> List[Dict]:
    """
    Extract item links for a single school from its Dúchas CBES page.

    Parameters
    ----------
    row : pd.Series
        School row with at least 'SchoolID', 'SchoolName', and 'SchoolURLClean'.

    Returns
    -------
    List[dict]
        List of item dictionaries with fields such as:
        SchoolID, PageID, ItemID, ItemURL, ItemTitle, SchoolName.
    """
    school_id = str(row["SchoolID"])
    school_name = row.get("SchoolName")
    url = build_school_items_url(row)

    try:
        resp = ENRICH_SESSION.get(url, timeout=30)
        resp.raise_for_status()
    except Exception as exc:
        print(f"[WARN] Failed to fetch items page for school {school_id} at {url}: {exc}")
        return []

    tree = html.fromstring(resp.content)

    # Strategy: look for anchors that match the item URL pattern and treat
    # the anchor text as the item title. You may want to tighten this by
    # restricting to specific containers if the page has many links.
    anchors = tree.xpath("//a[contains(@href, '/en/cbes/')]")
    rows: List[Dict] = []

    for a in anchors:
        href = a.get("href", "")
        m = ITEM_LINK_RE.search(href)
        if not m:
            continue

        item_school_id = m.group("SchoolID")
        page_id = m.group("PageID")
        item_id = m.group("ItemID")

        # Only keep items that belong to this school.
        if item_school_id != school_id:
            continue

        item_url = BASE_URL + href
        title_text = _clean_spaces(" ".join(a.itertext()))

        rows.append(
            {
                "SchoolID": item_school_id,
                "PageID": page_id,
                "ItemID": item_id,
                "ItemURL": item_url,
                "ItemTitle": title_text or None,
                "SchoolName": school_name,
            }
        )

    return rows


### 3.1 Collecting items for all schools

The function below applies `extract_items_for_school` to every school in `df_schools_full`. For transparency and ease of debugging the implementation uses a simple thread pool over schools. This balances speed and readability and makes it straightforward to throttle the rate of requests.

If Dúchas paginates item lists across multiple pages for a school, the current implementation will only capture item links on the main CBES page. In that case the `extract_items_for_school` function can be extended to follow additional pages.


In [None]:
def collect_all_items(df_schools_base, max_workers=8, throttle_sec=0.05):
    """
    Collect item links for all schools in the dataframe.

    Parameters
    ----------
    df_schools_base : pd.DataFrame
        Dataframe of schools. Must contain 'SchoolID', 'SchoolName',
        and 'SchoolURLClean'.
    max_workers : int
        Maximum number of worker threads used to fetch school pages.
    throttle_sec : float
        Delay between submissions of tasks, as a light throttle.

    Returns
    -------
    pd.DataFrame
        Dataframe with one row per item (SchoolID, PageID, ItemID, ItemURL,
        ItemTitle, SchoolName), de-duplicated by (SchoolID, PageID, ItemID).
    """
    records: List[Dict] = []

    def _worker(row_tuple):
        idx, row = row_tuple
        return extract_items_for_school(row)

    school_rows = list(df_schools_base.reset_index(drop=True).iterrows())
    start_time = time.time()

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {}
        for i, row_tuple in enumerate(school_rows, start=1):
            fut = pool.submit(_worker, row_tuple)
            futures[fut] = row_tuple[0]  # index of the school row

            time.sleep(throttle_sec)

        for j, fut in enumerate(concurrent.futures.as_completed(futures), start=1):
            idx = futures[fut]
            try:
                rows = fut.result()
                records.extend(rows)
            except Exception as exc:
                print(f"[WARN] items worker failed for school index {idx}: {exc}")

            if j % 50 == 0:
                elapsed = time.time() - start_time
                print(f"[{j} schools processed] total items so far={len(records)}, elapsed={elapsed:0.1f}s")

    df_items = pd.DataFrame(records)
    if not df_items.empty:
        df_items = df_items.drop_duplicates(subset=["SchoolID", "PageID", "ItemID"]).reset_index(drop=True)
    return df_items


df_items_index = collect_all_items(df_schools_full, max_workers=8, throttle_sec=0.05)
print(f"Collected {len(df_items_index)} unique items.")
df_items_index.head()


In [None]:
# Optional: save the item index
output_path_items_index = "../data/duchas_items_index.csv"
df_items_index.to_csv(output_path_items_index, index=False)
print(f"Wrote items index to {output_path_items_index}")


__________
## 4. Per-item HTML and XML scraper

With an item index in hand, the next step is to enrich each item with more detailed metadata. Dúchas typically provides an HTML view of the item and a corresponding XML representation that includes structural and linguistic information.

In this section we construct a per-item parser that:

1. Fetches the HTML view of an item and extracts its main text and any relevant metadata that are convenient to read from HTML.
2. Builds the corresponding XML URL for the item, fetches it, and extracts machine-readable fields that are only present in XML (for example counts of tokens, languages, or structural annotation).
3. Returns a dictionary for each item that can be assembled into a dataframe `df_items_full`.

The exact XPaths depend on the Dúchas HTML and XML schema; the focus here is to provide a clear, easily modifiable skeleton rather than hard-coding every detail.


In [None]:
from urllib.parse import urlparse


def build_item_xml_url(item_url):
    """
    Construct an XML URL for an item given its HTML URL.

    For CBES items, the XML URL is often parallel to the HTML URL, with the
    path switched from '/en/cbes/...' to '/xml/cbes/...'. This function
    applies that transformation.

    Adjust if Dúchas uses a different convention.
    """
    parsed = urlparse(item_url)
    # Replace '/en/cbes/' with '/xml/cbes/' in the path.
    xml_path = parsed.path.replace("/en/cbes/", "/xml/cbes/")
    return f"{parsed.scheme}://{parsed.netloc}{xml_path}"


def parse_item_html(item_url):
    """
    Fetch and parse the HTML representation of an item.

    Returns a dictionary with HTML-derived fields such as the visible title
    and main text. XPaths are intentionally conservative and may be adjusted
    to match the actual site structure.
    """
    try:
        resp = ENRICH_SESSION.get(item_url, timeout=30)
        resp.raise_for_status()
    except Exception as exc:
        print(f"[WARN] Failed to fetch item HTML at {item_url}: {exc}")
        return {
            "ItemHTMLTitle": None,
            "ItemHTMLText": None,
        }

    tree = html.fromstring(resp.content)

    # Title: use the main heading on the item page, if present.
    title_nodes = tree.xpath("//h1 | //h2")
    html_title = None
    if title_nodes:
        html_title = _clean_spaces(" ".join(title_nodes[0].itertext()))

    # Main text: many Dúchas pages put item text in a main content div.
    # This is a placeholder XPath; adjust to match the actual class / id.
    text_nodes = tree.xpath("//div[contains(@class, 'transcription') or contains(@class, 'content')]")
    if text_nodes:
        html_text = _clean_spaces(" ".join(text_nodes[0].itertext()))
    else:
        html_text = None

    return {
        "ItemHTMLTitle": html_title,
        "ItemHTMLText": html_text,
    }


In [None]:
def parse_item_xml(item_xml_url):
    """
    Fetch and parse the XML representation of an item.

    Returns a dictionary with XML-derived fields. The concrete XPaths used
    here are placeholders and should be adapted once the XML schema is
    inspected in more detail.
    """
    xml_root = fetch_xml(item_xml_url)
    if xml_root is None:
        return {
            "ItemXMLAvailable": False,
            "ItemTokenCount": None,
            "ItemLanguage": None,
        }

    # Example: count word-like elements (e.g. <w> nodes).
    tokens = xml_root.xpath("//w")
    token_count = len(tokens) if tokens is not None else None

    # Example: guess language from an attribute or element.
    # Replace this with the actual path once known.
    lang = string_xp(xml_root, "//language/text()") or string_xp(xml_root, "//@xml:lang")

    return {
        "ItemXMLAvailable": True,
        "ItemTokenCount": token_count,
        "ItemLanguage": lang,
    }


In [None]:
def parse_item_record(row: pd.Series) -> Dict:
    """
    Enrich a single item record from df_items_index.

    Parameters
    ----------
    row : pd.Series
        Row with at least 'SchoolID', 'PageID', 'ItemID', 'ItemURL', and 'ItemTitle'.

    Returns
    -------
    dict
        Dictionary containing the original IDs and URL, plus HTML- and XML-
        derived fields.
    """
    item_url = row["ItemURL"]
    item_xml_url = build_item_xml_url(item_url)

    base = {
        "SchoolID": row["SchoolID"],
        "PageID": row["PageID"],
        "ItemID": row["ItemID"],
        "ItemURL": item_url,
        "ItemXMLURL": item_xml_url,
        "ItemTitleIndex": row.get("ItemTitle"),
        "SchoolName": row.get("SchoolName"),
    }

    html_fields = parse_item_html(item_url)
    xml_fields = parse_item_xml(item_xml_url)

    out = base.copy()
    out.update(html_fields)
    out.update(xml_fields)
    return out



### 4.1 Enrich all items

The function below applies `parse_item_record` to every row in `df_items_index` and returns an enriched dataframe `df_items_full`. For clarity it uses a simple thread pool and a small throttle; this can be adjusted depending on how aggressively you want to crawl.

Because per-item scraping is more intensive than school-level or index-level requests, it is especially important to monitor runtime and request volume, which is the focus of Section 5.


In [None]:
def enrich_all_items(df_items_base, max_workers=8, throttle_sec=0.02):
    """
    Enrich all items in df_items_index by applying parse_item_record.

    Parameters
    ----------
    df_items_base : pd.DataFrame
        Base items dataframe with one row per item.
    max_workers : int
        Maximum number of worker threads.
    throttle_sec : float
        Delay between task submissions as a light throttle.

    Returns
    -------
    pd.DataFrame
        Enriched items dataframe.
    """
    records: List[Dict] = []

    def _worker(row_tuple):
        idx, row = row_tuple
        return parse_item_record(row)

    item_rows = list(df_items_base.reset_index(drop=True).iterrows())
    start_time = time.time()

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {}
        for i, row_tuple in enumerate(item_rows, start=1):
            fut = pool.submit(_worker, row_tuple)
            futures[fut] = row_tuple[0]

            time.sleep(throttle_sec)

        for j, fut in enumerate(concurrent.futures.as_completed(futures), start=1):
            idx = futures[fut]
            try:
                rec = fut.result()
                records.append(rec)
            except Exception as exc:
                print(f"[WARN] item worker failed for item index {idx}: {exc}")

            if j % 100 == 0:
                elapsed = time.time() - start_time
                print(f"[{j} items processed] elapsed={elapsed:0.1f}s")

    df_items_full = pd.DataFrame(records)
    # De-duplicate just in case.
    if not df_items_full.empty:
        df_items_full = df_items_full.drop_duplicates(
            subset=["SchoolID", "PageID", "ItemID"]
        ).reset_index(drop=True)
    return df_items_full


# Example: start with a small subset until you are happy with the parsing.
df_items_full_sample = enrich_all_items(df_items_index.head(50), max_workers=8, throttle_sec=0.02)
df_items_full_sample.head()


In [None]:
# # Once satisfied, run on the full items index.
# # This can take time depending on the size of df_items_index.

# df_items_full = enrich_all_items(df_items_index, max_workers=8, throttle_sec=0.02)
# output_path_items_full = "../data/duchas_items_full.csv"
# df_items_full.to_csv(output_path_items_full, index=False)
# print(f"Wrote full items table to {output_path_items_full}")


______________

## 5. Request budgeting and benchmarking

The full pipeline involves a substantial number of HTTP requests: index pages for schools, per-school item lists, and per-item HTML and XML documents.

Before running the entire per-item enrichment, it is helpful to estimate the total number of requests and to benchmark the runtime on a small sample of items. This section provides simple helpers for both tasks.

The goal is to make the cost of a full run explicit, so it is easier to decide on throttling parameters and whether to split the crawl into smaller batches.


In [None]:
def estimate_item_request_counts(df_items_base):
    """
    Estimate the number of HTTP requests needed for per-item enrichment.

    For each item, the current pipeline makes:
      - one HTML request (item page),
      - one XML request (item XML).

    Additional requests (for example, page-level XML shared across items)
    can be added later if needed.

    Parameters
    ----------
    df_items_base : pd.DataFrame
        Items index dataframe.

    Returns
    -------
    dict
        Dictionary with counts of estimated requests.
    """
    n_items = len(df_items_base)

    # One HTML and one XML request per item in the current design.
    n_html = n_items
    n_xml = n_items

    total = n_html + n_xml

    return {
        "n_items": n_items,
        "n_item_html_requests": n_html,
        "n_item_xml_requests": n_xml,
        "n_total_item_requests": total,
    }


est_counts = estimate_item_request_counts(df_items_index)
est_counts


In [None]:
def benchmark_item_enrichment(df_items_base, sample_size=50, max_workers=8, throttle_sec=0.02):
    """
    Benchmark the per-item enrichment on a small random sample.

    Parameters
    ----------
    df_items_base : pd.DataFrame
        Items index dataframe.
    sample_size : int
        Number of items to sample for the benchmark.
    max_workers : int
        Maximum number of worker threads.
    throttle_sec : float
        Delay between task submissions.

    Returns
    -------
    dict
        Benchmark statistics including time per item and projected total time.
    """
    if len(df_items_base) == 0:
        raise ValueError("df_items_base is empty; nothing to benchmark.")

    sample = df_items_base.sample(
        n=min(sample_size, len(df_items_base)),
        random_state=42
    ).reset_index(drop=True)

    start_time = time.time()
    _ = enrich_all_items(sample, max_workers=max_workers, throttle_sec=throttle_sec)
    elapsed = time.time() - start_time

    n_sample = len(sample)
    time_per_item = elapsed / n_sample
    projected_total = time_per_item * len(df_items_base)

    return {
        "sample_size": n_sample,
        "elapsed_seconds": elapsed,
        "time_per_item_seconds": time_per_item,
        "projected_total_seconds": projected_total,
        "projected_total_hours": projected_total / 3600.0,
    }


benchmark_stats = benchmark_item_enrichment(
    df_items_index,
    sample_size=50,
    max_workers=8,
    throttle_sec=0.02,
)
benchmark_stats
