<a href="https://colab.research.google.com/github/mialondon/llm-lod-enriching-heritage/blob/main/data_preparation_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing a Cultural Heritage Dataset for Named Entity Recognition

##Rationale

This recipe defines ways in which a researcher in CH can start from different types of common data/datasets and preprocess them so that they can be fed in a NER pipeline.

It starts from a common set of artifact descriptions, generated by calling a museum API such as the Cleveland Art Museum, and it shows the necessary steps to process the results and format them into a JSON file that can then be fed into a NER process.

The JSON files can be generated starting from an ID search (individual numerical records for artifacts) or a keyword search (e.g. "Manet").

# Overview of the process

1. **Fetch input**  
   - Get artworks from the Cleveland Museum of Art API (by ID or keyword), or use local text examples.  

2. **Clean text**  
   - Normalize Unicode, remove control characters, collapse spaces, mask URLs/emails.  

3. **Detect language**  
   - Use `langid.py` to assign a language code and confidence score.  

4. **Tokenize & split sentences**  
   - Use spaCy's lightweight tokenizer and sentencizer to create tokens and sentence spans.  

5. **Assemble JSON**  
   - Combine results into a structured record:  
     - `text_original` and `text_clean`  
     - `language` (code + score)  
     - `sentences` (spans + text)  
     - `tokens` (spans + features)  
     - `meta` (counts + IDs)

# Step 1: Installation of the necessary packages and libraries

In [None]:
%pip install spacy langid

Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-py3-none-any.whl size=1941171 sha256=1395d434614e8a0355587e8c8ac7fa43ea341e719c3616601062a94545259bc8
  Stored in directory: /root/.cache/pip/wheels/3c/bc/9d/266e27289b9019680d65d9b608c37bff1eff565b001c977ec5
Successfully built langid
Installing collected packages: langid
Successfully installed langid-1.1.6


In [None]:
# pip install requests spacy langid

import re, unicodedata, json
from typing import List, Dict, Any, Tuple

import requests
import spacy
import langid

# Step 2: Fetch Input from a Museum API

Get artworks from the Cleveland Museum of Art API (by ID or keyword), or use local text examples.

In [None]:
# ---------------------------
# 1) CMA API (single function)
# ---------------------------
def fetch_cma(source: str, *, mode: str = "search", limit: int = 100) -> List[Dict[str, Any]]: # the function outputs the first 100 results as a basic parameter
    """
    mode='id'     -> source is an artwork id
    mode='search' -> source is a keyword query
    Returns: list of {'id': <id>, 'text': <description or ''>}
    """
    base = "https://openaccess-api.clevelandart.org/api/artworks"

    if mode == "id":
        url = f"{base}/{source}"
        r = requests.get(url, timeout=20)
        if r.status_code == 404:
            # Graceful return instead of raising
            return []
        r.raise_for_status()
        data = r.json().get("data", {})
        return [{"id": data.get("id"), "text": (data.get("description") or "").strip()}]

    # search mode
    params = {"q": source, "skip": 0, "limit": limit}
    r = requests.get(base, params=params, timeout=30)
    r.raise_for_status()
    out = []
    for a in r.json().get("data", []):
        out.append({"id": a.get("id"), "text": (a.get("description") or "").strip()})
    return out

The code in the cells below will display a preview of the results for each function.

In [None]:
# test the fetch_cma function and print an overview of the results, using the "search" mode and a keyword, e.g. "monet"
records = fetch_cma("monet", mode="search")
if not records:
    print("CMA by id: no record (likely 404).")

print(records[:10])

[{'id': 135382, 'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."}, {'id': 136510, 'text': 'A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France. The resultant canvases are notable for their varied motifs, formats, and sizes. Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface. By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format th

In [None]:
# test the fetch_cma function and print an overview of the results, using the "id" mode and a sample record id.
records = fetch_cma("135382", mode="id")
if not records:
    print("CMA by id: no record (likely 404).")

print(records[:10])

[{'id': 135382, 'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."}]


If you

# Step 3: Clean text

Normalize Unicode, remove control characters, collapse spaces, mask URLs/emails.  

In [None]:
# ---------------------------
# 2) Pre-processing helpers
# ---------------------------
def clean_text(text: str) -> str:
    t = unicodedata.normalize("NFC", text)
    t = re.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", t)  # strip control chars
    t = re.sub(r"[ \t\u00A0]+", " ", t)                                   # collapse spaces
    t = re.sub(r"https?://\S+", "<URL>", t)                               # mask URLs
    t = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", t)               # mask emails
    return t

In [None]:
# call the function clean text on the records output from the previous function, and print the first results.
clean_records = [clean_text(r["text"]) for r in records]
print(clean_records[:10])

["This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."]


Then, detect the language of the dataset using the SpaCy langid module. This will help define the appropriate sentencizer for the following step.

In [None]:
# detect language using langid

def detect_language(text: str) -> Dict[str, Any]:
    if not text:
        return {"language": "xx", "score": None}
    code, score = langid.classify(text)  # score = log-likelihood
    return {"language": code, "score": float(score)}

In [None]:
# call the function detect_language and print the language id for the records output
lang_id = [detect_language(r) for r in clean_records]
print(lang_id[:10])

[{'language': 'en', 'score': -868.9007034301758}]


# Step 4 Preprocess the texts

Using the Spacy sentencizer, based on the language detected by langid, split the output into sentences.

Then, assemble the structured record as a JSON output.

- Output fields:

     - `text_original` and `text_clean`  
     - `language` (code + score)  
     - `sentences` (spans + text)  
     - `tokens` (spans + features)  
     - `meta` (counts + IDs)

In [None]:
# ---------------------------
# 3) Preprocess (fixed generator usage)
# ---------------------------
def preprocess_texts(
    texts_with_meta: List[Tuple[str, Dict[str, Any]]],
    do_clean: bool = True,
    do_langid: bool = True,
    do_sentences: bool = True,
    do_tokens: bool = True,
) -> List[Dict[str, Any]]:
    nlp = spacy.blank("xx")
    if do_sentences:
        nlp.add_pipe("sentencizer")

    results: List[Dict[str, Any]] = []

    if do_tokens or do_sentences:
        # Iterate with zip to consume the generator safely
        for (text, meta), doc in zip(texts_with_meta, nlp.pipe([t for t, _ in texts_with_meta])):
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}

            # Sentences
            sents = []
            if do_sentences:
                for sid, s in enumerate(doc.sents):
                    sents.append({"id": sid, "start": s.start_char, "end": s.end_char, "text": s.text})

            # Tokens
            tokens = []
            if do_tokens:
                tok2sent = {}
                if do_sentences:
                    for sid, s in enumerate(doc.sents):
                        for tok in s:
                            tok2sent[tok.i] = sid
                for tok in doc:
                    tokens.append({
                        "id": tok.i,
                        "text": tok.text,
                        "start": tok.idx,
                        "end": tok.idx + len(tok.text),
                        "ws": tok.whitespace_ != "",
                        "is_punct": tok.is_punct,
                        "sent_id": tok2sent.get(tok.i) if do_sentences else None
                    })

            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": sents,
                "tokens": tokens,
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": len(tokens),
                    "sentence_count": len(sents),
                }
            })
    else:
        # No tokenization/sentences requested
        for text, meta in texts_with_meta:
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}
            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": [],
                "tokens": [],
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": 0,
                    "sentence_count": 0,
                }
            })

    return results

In [None]:
# call the preprocessing function and print the first results
pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records]
out_api_id = preprocess_texts(pairs)
print(out_api_id[:10])

[{'text_original': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.", 'text_clean': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.", 'language': {'language': 'en', 'score': -868.9007034301758}, 'sentences': [{'id': 0, 'start': 0, 'end': 130, 'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."}, {'id': 1, 'start': 131, 'end': 324, 'text': 'Her face is re

# Step 5: Put it all together

Call the functions. The code below provides three different options, depending on the types of calls, and it also includes an option that processes local examples.

Each option can be run independently and saves the output in a file within the Colab notebook.

In [None]:
# ---------------------------
# 4) MAIN (robust demos)
# ---------------------------
if __name__ == "__main__":
    # OPTION A: CMA by search mode using record ids (handles 404 gracefully)
    records = fetch_cma("135382", mode="id")
    if not records:
        print("CMA by keyword: no record (likely 404).")
    else:
        pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records]
        out_api_id = preprocess_texts(pairs)
        print("CMA by id:")
        print(json.dumps(out_api_id, ensure_ascii=False, indent=2))

        # save the results in a local file
        output_file = "/content/output_api_id.json"
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(out_api_id, f)
            print(f"Saved preprocessed files to {output_file}")

CMA by id:
[
  {
    "text_original": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "text_clean": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "language": {
      "language": "en",
      "score": -868.9007034301758
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 130,
        "text": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of

In [None]:
out_api_id

[{'text_original': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'text_clean': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'language': {'language': 'en', 'score': -868.9007034301758},
  'sentences': [{'id': 0,
    'start': 0,
    'end': 130,
    'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."},
   {'id': 1,
    'start': 131,
    'end':

In [None]:
    # OPTION B: CMA search (take a couple of hits, skip empty descriptions)
if __name__ == "__main__":
  try:
        records = fetch_cma("monet", mode="search", limit=5)
        pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records if r["text"]]
        pairs = pairs[:2]  # first two with non-empty text
        if pairs:
            out_api_search = preprocess_texts(pairs)
            print("\nCMA search:")
            print(json.dumps(out_api_search, ensure_ascii=False, indent=2))
            # save the results in a local file
            output_file = "/content/output_api_search.json"
            with open(output_file, "w", encoding="utf-8") as f:
                json.dump(out_api_search, f)
                print(f"Saved preprocessed files to {output_file}")
        else:
            print("\nCMA search: no descriptions returned for this query.")
  except Exception as e:
        print("API search failed:", e)


CMA search:
[
  {
    "text_original": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "text_clean": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "language": {
      "language": "en",
      "score": -868.9007034301758
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 130,
        "text": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors 

In [None]:
    # OPTION C: Local examples
if __name__ == "__main__":
  examples = [
        ("Rome is the capital of Italy.", {"source": "local"}),
        ("Mark Rutte bezocht gisteren Groningen.", {"source": "local"})
    ]
  out_local = preprocess_texts(examples)
  print("\nLocal examples:")
  print(json.dumps(out_local, ensure_ascii=False, indent=2))

# save the results in a local file
output_file = "/content/output_local.json"
with open(output_file, "w", encoding="utf-8") as f:
  json.dump(out_local, f)
  print(f"Results saved in {output_file}")


Local examples:
[
  {
    "text_original": "Rome is the capital of Italy.",
    "text_clean": "Rome is the capital of Italy.",
    "language": {
      "language": "en",
      "score": -73.82194948196411
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 29,
        "text": "Rome is the capital of Italy."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "Rome",
        "start": 0,
        "end": 4,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "is",
        "start": 5,
        "end": 7,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "the",
        "start": 8,
        "end": 11,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "capital",
        "start": 12,
        "end": 19,
        "ws": true,
        "is_punct": false

In [None]:
# upload a file from your computer into the /content folder of the Colab notebook.
# this step will enable you to run option D with your own .txt file.

from google.colab import files
uploaded = files.upload()


Saving input.txt to input (1).txt


In [None]:
# OPTION D: Local examples from a txt file (make sure to upload the txt file in the content folder of the Colab Notebook)
if __name__ == "__main__":
  # open a local .txt file as input. remember to change the path and filename to your file.
  with open("/content/input.txt", "r") as f:
    text = f.read()

    # then process the text of the input file
    examples = [(text, {"source": "local"})]
    out_local = preprocess_texts(examples)
    print("\nLocal examples:")
    print(json.dumps(out_local, ensure_ascii=False, indent=2))

# save the results in a local file
output_file = "/content/output_file.json"
with open(output_file, "w", encoding="utf-8") as f:
  json.dump(out_local, f)
  print(f"Results saved in {output_file}")


Local examples:
[
  {
    "text_original": "STRABO, the author of this work, was born at Amasia, or Amasijas, a town situated in the gorge of the mountains through which passes the river Iris, now the Ieschil Irmak, in Pontus, which he has described in the 12th book.[*] He lived during the reign of Augustus, and the earlier part of the reign of Tiberius; for in the 13th book[*] he relates how Sardes and other cities, which had suffered severely from earthquakes, had been repaired by the provident care of Tiberius the present Emperor; but the exact date of his birth, as also of his death, are subjects of conjecture only. Coraÿ and Groskurd conclude, though by a somewhat different argument, that he was born in the year B. C. 66, and the latter that he died A. D. 24. The date of his birth as argued by Groskurd, proceeds on the assumption that Strabo was in his thirty-eighth year when he went from Gyaros to Corinth, at which latter place Octavianus Caesar was then staying on his return to

## Save the results in a local .zip file

The output files are only saved locally in the Google Colab notebook, and will be deleted after the notebook is closed.

Two options are available:
1. If you are only interested in one of the files you generated, you can simply download the individual output file.
2. If you ran all options and want to save all output files, you can download all of them as a zip folder.


In [None]:
# take the output_file generated by your chosen option and download it on your machine.
from google.colab import files
files.download("/content/output_file.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# take the output_file saved under content and download it locally as a .zip file
from google.colab import files
!zip -r /content/output_files.zip output_api_id.json output_api_search.json output_local.json output_file.json
files.download("/content/output_files.zip")

  adding: output_local.json (deflated 86%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Full Code (previous version)

In [None]:
# pip install requests spacy langid

import re, unicodedata, json
from typing import List, Dict, Any, Tuple

import requests
import spacy
import langid

# ---------------------------
# 1) CMA API (single function)
# ---------------------------
def fetch_cma(source: str, *, mode: str = "search", limit: int = 100) -> List[Dict[str, Any]]:
    """
    mode='id'     -> source is an artwork id
    mode='search' -> source is a keyword query
    Returns: list of {'id': <id>, 'text': <description or ''>}
    """
    base = "https://openaccess-api.clevelandart.org/api/artworks"

    if mode == "id":
        url = f"{base}/{source}"
        r = requests.get(url, timeout=20)
        if r.status_code == 404:
            # Graceful return instead of raising
            return []
        r.raise_for_status()
        data = r.json().get("data", {})
        return [{"id": data.get("id"), "text": (data.get("description") or "").strip()}]

    # search mode
    params = {"q": source, "skip": 0, "limit": limit}
    r = requests.get(base, params=params, timeout=30)
    r.raise_for_status()
    out = []
    for a in r.json().get("data", []):
        out.append({"id": a.get("id"), "text": (a.get("description") or "").strip()})
    return out

# ---------------------------
# 2) Pre-processing helpers
# ---------------------------
def clean_text(text: str) -> str:
    t = unicodedata.normalize("NFC", text)
    t = re.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", t)  # strip control chars
    t = re.sub(r"[ \t\u00A0]+", " ", t)                                   # collapse spaces
    t = re.sub(r"https?://\S+", "<URL>", t)                               # mask URLs
    t = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", t)               # mask emails
    return t

def detect_language(text: str) -> Dict[str, Any]:
    if not text:
        return {"language": "xx", "score": None}
    code, score = langid.classify(text)  # score = log-likelihood
    return {"language": code, "score": float(score)}

# ---------------------------
# 3) Preprocess (fixed generator usage)
# ---------------------------
def preprocess_texts(
    texts_with_meta: List[Tuple[str, Dict[str, Any]]],
    do_clean: bool = True,
    do_langid: bool = True,
    do_sentences: bool = True,
    do_tokens: bool = True,
) -> List[Dict[str, Any]]:
    nlp = spacy.blank("xx")
    if do_sentences:
        nlp.add_pipe("sentencizer")

    results: List[Dict[str, Any]] = []

    if do_tokens or do_sentences:
        # Iterate with zip to consume the generator safely
        for (text, meta), doc in zip(texts_with_meta, nlp.pipe([t for t, _ in texts_with_meta])):
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}

            # Sentences
            sents = []
            if do_sentences:
                for sid, s in enumerate(doc.sents):
                    sents.append({"id": sid, "start": s.start_char, "end": s.end_char, "text": s.text})

            # Tokens
            tokens = []
            if do_tokens:
                tok2sent = {}
                if do_sentences:
                    for sid, s in enumerate(doc.sents):
                        for tok in s:
                            tok2sent[tok.i] = sid
                for tok in doc:
                    tokens.append({
                        "id": tok.i,
                        "text": tok.text,
                        "start": tok.idx,
                        "end": tok.idx + len(tok.text),
                        "ws": tok.whitespace_ != "",
                        "is_punct": tok.is_punct,
                        "sent_id": tok2sent.get(tok.i) if do_sentences else None
                    })

            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": sents,
                "tokens": tokens,
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": len(tokens),
                    "sentence_count": len(sents),
                }
            })
    else:
        # No tokenization/sentences requested
        for text, meta in texts_with_meta:
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}
            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": [],
                "tokens": [],
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": 0,
                    "sentence_count": 0,
                }
            })

    return results

# ---------------------------
# 4) MAIN (robust demos)
# ---------------------------
if __name__ == "__main__":
    # OPTION A: CMA by id (handles 404 gracefully)
    records = fetch_cma("190889", mode="id")  # example id; may not exist
    if not records:
        print("CMA by id: no record (likely 404).")
    else:
        pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records]
        out_api_id = preprocess_texts(pairs)
        print("CMA by id:")
        print(json.dumps(out_api_id, ensure_ascii=False, indent=2))

    # OPTION B: CMA search (take a couple of hits, skip empty descriptions)
    try:
        records = fetch_cma("Raphael", mode="search", limit=5)
        pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records if r["text"]]
        pairs = pairs[:2]  # first two with non-empty text
        if pairs:
            out_api_search = preprocess_texts(pairs)
            print("\nCMA search:")
            print(json.dumps(out_api_search, ensure_ascii=False, indent=2))
        else:
            print("\nCMA search: no descriptions returned for this query.")
    except Exception as e:
        print("API search failed:", e)

    # OPTION C: Local examples
    examples = [
        ("Barack Obama met Angela Merkel in Berlin.", {"source": "local"}),
        ("Mark Rutte bezocht gisteren Groningen.", {"source": "local"})
    ]
    out_local = preprocess_texts(examples)
    print("\nLocal examples:")
    print(json.dumps(out_local, ensure_ascii=False, indent=2))

CMA by id: no record (likely 404).

CMA search:
[
  {
    "text_original": "West was the first American artist to study in Italy, where he spent three years before permanently settling in London. He so admired the artistic ideals of the Italian Renaissance master Raphael that he named his eldest son after him, and he imitated Raphael’s celebrated Madonna of the Chair when composing this tender double portrait of his wife and child.",
    "text_clean": "West was the first American artist to study in Italy, where he spent three years before permanently settling in London. He so admired the artistic ideals of the Italian Renaissance master Raphael that he named his eldest son after him, and he imitated Raphael’s celebrated Madonna of the Chair when composing this tender double portrait of his wife and child.",
    "language": {
      "language": "en",
      "score": -1005.8172142505646
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 119,
        "text