<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/data_preparation/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare Cultural Heritage Data for Named Entity Analysis

This notebook prepares data for named enity analysis. It performs four tasks which are represented by the next four chapters:

1. Install required software libraries
2. Download text from museum website
3. Preprocess text for named entity analysis
4. Save results

The fifth chapter provides alternatives for handling texts from other sources than websites:

5. Alternatives for reading text data

## 1. Install required software libraries

Preprocessing data requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

In [22]:
import importlib
import subprocess

def safe_import(package_name):
    """Import a package;. If it missing, download it first"""
    try:
        return importlib.import_module(package_name)
    except ImportError:
        print(f"📦 {package_name} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        return importlib.import_module(package_name)

spacy = safe_import("spacy")
langid = safe_import("langid")

In [39]:
import regex, unicodedata, json
from typing import List, Dict, Any, Tuple, Optional
import hashlib
import polars as pl

import requests

try:
    from google.colab import files
    IN_COLAB = True
except:
    IN_COLAB = False

## 2. Download text from museum website

We use the description of a painting by Claude Monet from the artwork Cleveland Museum website as example text

In [21]:
base_url = "https://openaccess-api.clevelandart.org/api/artworks"
data_source = "CMA"

def fetch_cma(query_string: str) -> List[Dict[str, Any]]:
    """Fetch metadata from Cleveland Museum of Art website"""
    try:
        response = requests.get(base_url, params={"q": query_string, "skip": 0, "limit": 100}, timeout=10)
    except:
        return []
    response.raise_for_status()
    artworks = response.json().get("data", [])
    if artworks == []:
        return []
    else:
        return [{"id": artworks[0].get("id"), 
                 "data_source": data_source,
                 "text_original": (artworks[0].get("description") or "").strip()}]

In [4]:
texts = fetch_cma("monet")
if not texts:
    print("No texts were found. Are you connected to the internet?")
else:
    print("Text found:", texts[0])

Text found: {'id': 135382, 'data_source': 'CMA', 'text_original': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."}


## 3. Preprocess text for named entity analysis

Three steps are performed while preprocessing the texts:

1. Text cleanup: remove non-text characters, urls and email addresses
2. Detect the language of the text
3. Split the text in sentences and tokens

In [68]:
def cleanup_text(text: str) -> str:
    """Cleanup text: remove non-text characters, urls and email addresses"""
    text = unicodedata.normalize("NFC", text)
    text = regex.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", text)
    text = regex.sub(r"[ \t\u00A0]+", " ", text)
    text = regex.sub("\s+", " ", text)
    text = regex.sub(r"https?://\S+", "<URL>", text)
    text = regex.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", text)
    return text

In [23]:
def detect_text_language(text: str) -> Dict[str, Any]:
    """Detect language text is written in and return id of the language"""
    return langid.classify(text)[0]

In [24]:
def preprocess_text(text, language_id):
    """Preprocess a single text: divide it in sentences and tokens"""
    try:
        spacy_model = spacy.blank(language_id)
    except:
        print(f"Cannot load model for language {language_id}")
        spacy_model = spacy.blank("xx")
    spacy_model.add_pipe("sentencizer")
    preprocessed_text = spacy_model(text)
    sentences = [{"id": sentence_id, 
                  "start": sentence.start_char, 
                  "end": sentence.end_char, 
                  "text": sentence.text} for sentence_id, sentence in enumerate(preprocessed_text.sents)]
    tok2sent = {token.i: sentence_id for sentence_id, sentence in enumerate(preprocessed_text.sents) 
                                     for token in sentence}
    tokens = [{"id": token.i,
               "text": token.text,
               "start": token.idx,
               "end": token.idx + len(token.text),
               "ws": token.whitespace_ != "",
               "is_punct": token.is_punct,
               "sent_id": tok2sent.get(token.i)} for token in preprocessed_text]
    return sentences, tokens

In [25]:
def preprocess_texts(texts) -> List[Dict[str, Any]]:
    """Preprocess a list of texts and return the results as a list of dictionaries"""
    results: List[Dict[str, Any]] = []
    for text in texts:
        sentences, tokens = preprocess_text(text["text_cleaned"], text["language_id"])
        results.append({"meta": {**{key: text[key] for key in text if not regex.search("text", key)},
                                 "char_count": len(text),
                                 "token_count": len(tokens),
                                 "sentence_count": len(sentences)},
                        "text_original": text["text_original"],
                        "text_cleaned": text["text_cleaned"],
                        "sentences": sentences,
                        "tokens": tokens})
    return results

In [87]:
texts_cleaned = [text | {"text_cleaned": cleanup_text(text["text_original"])} for text in texts]
print({key: texts_cleaned[0][key] 
       for key in texts_cleaned[0]
       if key not in ["text_original"]})

{'id': 1, 'data_source': 'wikipedia', 'text_cleaned': 'Der Kölner Dom (offiziell Hohe Domkirche zu Köln) ist eine römisch-katholische Kirche in Köln unter dem Patrozinium des Apostels Petrus. Er ist die Kathedrale des Erzbistums Köln sowie Metropolitan\xadkirche der Kirchenprovinz Köln. Hausherr ist der Dompropst. Der Kölner Dom ist eine der größten Kathedralen im gotischen Baustil. Sein Bau wurde 1248 im Auftrag von Konrad I. nach Entwurf von Meister Gerhard begonnen und 1880 im Auftrag von Friedrich Wilhelm IV. nach Entwurf von Ernst Friedrich Zwirner vollendet. Einige Kunsthistoriker haben den Dom wegen seiner einheitlichen und ausgewogenen Bauform als „vollkommene Kathedrale“ bezeichnet. Mit 157,22 Metern ist er nach dem Ulmer Münster der zweithöchste Sakralbau Deutschlands und hinter der Basilika Notre-Dame-de-la-Paix de Yamoussoukro die dritthöchste Kirche der Welt.'}


In [88]:
texts_with_language_ids = [{"language_id": detect_text_language(text["text_cleaned"])} | text
                            for text in texts_cleaned]
print({key: texts_with_language_ids[0][key] 
       for key in texts_with_language_ids[0]
       if key not in ["text_original"]})

{'language_id': 'de', 'id': 1, 'data_source': 'wikipedia', 'text_cleaned': 'Der Kölner Dom (offiziell Hohe Domkirche zu Köln) ist eine römisch-katholische Kirche in Köln unter dem Patrozinium des Apostels Petrus. Er ist die Kathedrale des Erzbistums Köln sowie Metropolitan\xadkirche der Kirchenprovinz Köln. Hausherr ist der Dompropst. Der Kölner Dom ist eine der größten Kathedralen im gotischen Baustil. Sein Bau wurde 1248 im Auftrag von Konrad I. nach Entwurf von Meister Gerhard begonnen und 1880 im Auftrag von Friedrich Wilhelm IV. nach Entwurf von Ernst Friedrich Zwirner vollendet. Einige Kunsthistoriker haben den Dom wegen seiner einheitlichen und ausgewogenen Bauform als „vollkommene Kathedrale“ bezeichnet. Mit 157,22 Metern ist er nach dem Ulmer Münster der zweithöchste Sakralbau Deutschlands und hinter der Basilika Notre-Dame-de-la-Paix de Yamoussoukro die dritthöchste Kirche der Welt.'}


In [89]:
preprocessed_texts = preprocess_texts(texts_with_language_ids)
print({key: preprocessed_texts[0][key] 
       for key in preprocessed_texts[0] 
       if key not in ["text_original", "text_cleaned", "tokens"]} | 
      {"tokens": preprocessed_texts[0]["tokens"][:3] + ["..."]})

{'meta': {'language_id': 'de', 'id': 1, 'data_source': 'wikipedia', 'char_count': 5, 'token_count': 128, 'sentence_count': 7}, 'sentences': [{'id': 0, 'start': 0, 'end': 136, 'text': 'Der Kölner Dom (offiziell Hohe Domkirche zu Köln) ist eine römisch-katholische Kirche in Köln unter dem Patrozinium des Apostels Petrus.'}, {'id': 1, 'start': 137, 'end': 229, 'text': 'Er ist die Kathedrale des Erzbistums Köln sowie Metropolitan\xadkirche der Kirchenprovinz Köln.'}, {'id': 2, 'start': 230, 'end': 257, 'text': 'Hausherr ist der Dompropst.'}, {'id': 3, 'start': 258, 'end': 327, 'text': 'Der Kölner Dom ist eine der größten Kathedralen im gotischen Baustil.'}, {'id': 4, 'start': 328, 'end': 512, 'text': 'Sein Bau wurde 1248 im Auftrag von Konrad I. nach Entwurf von Meister Gerhard begonnen und 1880 im Auftrag von Friedrich Wilhelm IV. nach Entwurf von Ernst Friedrich Zwirner vollendet.'}, {'id': 5, 'start': 513, 'end': 642, 'text': 'Einige Kunsthistoriker haben den Dom wegen seiner einheitlic

## 4. Save results

In [72]:
def save_results(texts):
    """Save preprocessed texts in a json file"""
    json_string = json.dumps(texts, ensure_ascii=False, indent=2)
    hash = hashlib.sha1(json_string.encode("utf-8")).hexdigest()
    output_file_name = f"output_{hash}.json"
    with open(output_file_name, "w", encoding="utf-8") as output_file:
        print(json_string, end="", file=output_file)
        output_file.close()
        print(f"Saved preprocessed texts to {output_file_name}")

In [90]:
save_results(preprocessed_texts)

Saved preprocessed texts to output_38ecb18ac5f2e220a8400132dfc59b8a20b34ef6.json


## 5. Alternatives for reading text data

Here are some methods for reading texts as alternatives for reading them from a museum website as presented in chapter 2 of this notebook. Run these code blocks instead of the code blocks of chapter 2 and then proceed with chapter 3.

### 5.1. Small text examples defined in the code

We use descriptions from three famous artworks from Wikipedia, written in other languages than English.

In [62]:
texts = [{"id": 1, "data_source": "wikipedia", "text_original": """De Nachtwacht is een schuttersstuk van de 
           Hollandse schilder Rembrandt van Rijn (1606-1669) dat in 1642 gereed kwam. De huidige officiële titel 
           luidt: Officieren en andere schutters van wijk II in Amsterdam, onder leiding van kapitein Frans 
           Banninck Cocq en luitenant Willem van Ruytenburch, bekend als ‘De Nachtwacht’."""},
         {"id": 2, "data_source": "wikipedia", "text_original": """La Gioconda, nota anche come Monna Lisa, è un 
           dipinto a olio su tavola di pioppo realizzato da Leonardo da Vinci (77 × 53 cm e 13 mm di spessore), 
           databile al 1503-1506 circa e conservato nel Museo del Louvre di Parigi col numero 779 di catalogo."""},
         {"id": 3, "data_source": "wikipedia", "text_original": """Le Penseur (initialement intitulé Le Poète) est 
           un des chefs-d'œuvre emblématiques d'Auguste Rodin."""}]

### 5.2. Reading texts from a csv file

We use the artefact descriptions from the Egyptian Museum in Turin as example texts. We use two columns of the file, one with the identifier of the artefact and one with the description text

Change the name of the file to your own file if you need to process other data. Do not forget to change the values of the variables `data_source`, `id_column_name` and `text_column_name` as well.

In [82]:
file_name = "em.csv"
data_source = "EMT"
id_column_name = "Inventory Number"
text_column_name = "Description"


def read_emt_data(file_name):
    """Read texts from the Egyptian Museum in Turin from a csv file"""
    try:
        table_pl = pl.read_csv(file_name)[id_column_name, text_column_name]
        return [{"id": row[0], "data_source": data_source, "text_original": row[1]}
                for row in table_pl.iter_rows()]
    except:
        print(f"Cannot read data from file {file_name}!")
        return []

In [75]:
texts = read_emt_data(file_name)
if len(texts) <= 0:
    print(f"No texts found in file {file_name}!")
else:
    print("Text found:", texts[0])

Text found: {'id': 'C. 0115', 'data_source': 'EMT', 'text_original': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


### 5.3. Read a single text from a text file

We use a description of a monument from Wikipedia, written in a different language than English

In [85]:
file_name = "wikipedia.txt"
data_source = "wikipedia"


def read_wikipedia_file(file_name):
    """Read a text from Wikipedia from a text file"""
    try:
        with open(file_name, "r") as infile:
            text = infile.read().strip()
            infile.close()
        return [{"id": 1, "data_source": data_source, "text_original": text}]
    except:
        print(f"Cannot read data from file {file_name}!")
        return []

In [86]:
texts = read_wikipedia_file(file_name)
if len(texts) <= 0:
    print(f"No texts found in file {file_name}!")
else:
    print("Text found:", texts[0])

Text found: {'id': 1, 'data_source': 'wikipedia', 'text_original': 'Der Kölner Dom (offiziell Hohe Domkirche zu Köln) ist eine römisch-katholische Kirche in Köln unter dem Patrozinium des Apostels Petrus. Er ist die Kathedrale des Erzbistums Köln sowie Metropolitan\xadkirche der Kirchenprovinz Köln. Hausherr ist der Dompropst. Der Kölner Dom ist eine der größten Kathedralen im gotischen Baustil. Sein Bau wurde 1248 im Auftrag von Konrad I. nach Entwurf von Meister Gerhard begonnen und 1880 im Auftrag von Friedrich Wilhelm IV. nach Entwurf von Ernst Friedrich Zwirner vollendet. Einige Kunsthistoriker haben den Dom wegen seiner einheitlichen und ausgewogenen Bauform als „vollkommene Kathedrale“ bezeichnet. Mit 157,22 Metern ist er nach dem Ulmer Münster der zweithöchste Sakralbau Deutschlands und hinter der Basilika Notre-Dame-de-la-Paix de Yamoussoukro die dritthöchste Kirche der Welt.'}


## Old code

### Save the results in a local .zip file

The output files are only saved locally in the Google Colab notebook, and will be deleted after the notebook is closed.

Two options are available:
1. If you are only interested in one of the files you generated, you can simply download the individual output file.
2. If you ran all options and want to save all output files, you can download all of them as a zip folder.


In [None]:
# take the output_file generated by your chosen option and download it on your machine.
from google.colab import files
files.download("output_file.json")

In [None]:
# take the output_file saved under content and download it locally as a .zip file
from google.colab import files
!zip -r output_files.zip output_api_id.json output_api_search.json output_local.json output_file.json
files.download("output_files.zip")