<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/data_preparation/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare Cultural Heritage Data for Named Entity Analysis

This notebook prepares data for named entity analysis. It performs four tasks which are represented by the next four chapters:

1. Install required software libraries
2. Download text from museum website
3. Preprocess text for named entity analysis
4. Save results

The fifth chapter provides alternatives for handling texts from other sources than websites:

5. Alternatives for reading text data

## 1. Install required software libraries

Preprocessing data requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

First we import standard libraries which should always be available

In [1]:
import regex
import hashlib
import importlib
import json
import regex
import requests
import subprocess
import sys
from typing import List, Dict, Any, Tuple, Optional
import unicodedata

Next we import packages which may require installation on this device

In [2]:
char_package = "📦"
char_success = "✅"
char_failiure = "❌"


def safe_import(package_name):
    """Import a package;. If it missing, download it first"""
    try:
        return importlib.import_module(package_name)
    except ImportError:
        print(f"{char_package} {package_name} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"Finished installing {package_name}")
        return importlib.import_module(package_name)


spacy = safe_import("spacy")
langid = safe_import("langid")
regex = safe_import("regex")
pl = safe_import("polars")

Finally we set setting required for Google Colab

In [3]:
from IPython.display import HTML, display


def set_css():
    """Fix line wrapping of output texts for Google Colab"""
    display(HTML("<style> pre { white-space: pre-wrap; </style>"))


try:
    from google.colab import files
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    get_ipython().events.register('pre_run_cell', set_css)

## 2. Reading texts from a csv file

We use the artefact descriptions from the Egyptian Museum in Turin as example texts. We use two columns of the file, one with the identifier of the artefact and one with the description text

First we define a function for reading the texts

In [4]:
file_name = "em.csv"
data_source = "EMT"
id_column_name = "Inventory Number"
text_column_name = "Description"


def read_emt_data(file_name):
    """Read texts from the Egyptian Museum in Turin from a csv file"""
    try:
        table_pl = pl.read_csv(file_name)[id_column_name, text_column_name]
        table_pl.write_csv("tmp.csv")
        return [{"id": row[0], "data_source": data_source, "text_original": row[1]}
                for row in table_pl.iter_rows()]
    except:
        print(f"{char_failiure} Cannot read data from file {file_name}!")
        return []

Next we call the function and store the variable in the variable `texts`. We show the first text to check if the process was successful. Note that the text includes (`id`) the text identifier as metadata

In [5]:
texts = read_emt_data(file_name)
if len(texts) <= 0:
    print(f"{char_failiure} No texts found in file {file_name}!")
else:
    print("Text found:", texts[0])

Text found: {'id': 'C. 0115', 'data_source': 'EMT', 'text_original': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


## 3. Preprocess text for named entity analysis

Three steps are performed while preprocessing the texts:

1. Text cleanup: remove non-text characters, urls and email addresses
2. Detect the language of the text
3. Split the text in sentences and tokens

We start with defining five functions for performing the preprocessing tasks

In [6]:
def cleanup_text(text: str) -> str:
    """Cleanup text: remove non-text characters, urls and email addresses"""
    text = unicodedata.normalize("NFC", text)
    text = regex.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", text)
    text = regex.sub(r"[ \t\u00A0]+", " ", text)
    text = regex.sub(r"\s+", " ", text)
    text = regex.sub(r"https?://\S+", "<URL>", text)
    text = regex.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", text)
    return text

In [7]:
def detect_text_language(text: str) -> Dict[str, Any]:
    """Detect language text is written in and return id of the language"""
    return langid.classify(text)[0]

In [8]:
def preprocess_text(text, language_id):
    """Preprocess a single text: divide it in sentences and tokens"""
    try:
        spacy_model = spacy.blank(language_id)
    except:
        print(f"{char_failiure} Cannot load model for language {language_id}")
        spacy_model = spacy.blank("xx")
    spacy_model.add_pipe("sentencizer")
    preprocessed_text = spacy_model(text)
    sentences = [{"id": sentence_id, 
                  "start": sentence.start_char, 
                  "end": sentence.end_char, 
                  "text": sentence.text} for sentence_id, sentence in enumerate(preprocessed_text.sents)]
    tok2sent = {token.i: sentence_id for sentence_id, sentence in enumerate(preprocessed_text.sents) 
                                     for token in sentence}
    tokens = [{"id": token.i,
               "text": token.text,
               "start": token.idx,
               "end": token.idx + len(token.text),
               "ws": token.whitespace_ != "",
               "is_punct": token.is_punct,
               "sent_id": tok2sent.get(token.i)} for token in preprocessed_text]
    return sentences, tokens

In [9]:
def preprocess_texts(texts) -> List[Dict[str, Any]]:
    """Preprocess a list of texts and return the results as a list of dictionaries"""
    results: List[Dict[str, Any]] = []
    for text in texts:
        sentences, tokens = preprocess_text(text["text_cleaned"], text["language_id"])
        results.append({"meta": {**{key: text[key] for key in text if not regex.search("text", key)},
                                 "char_count": len(text),
                                 "token_count": len(tokens),
                                 "sentence_count": len(sentences)},
                        "text_original": text["text_original"],
                        "text_cleaned": text["text_cleaned"],
                        "sentences": sentences,
                        "tokens": tokens})
    return results

In [10]:
def show_example_text(text, skipped_fields=[]):
    """Show example text"""
    text_shown = {key: text[key] for key in text if key not in skipped_fields}
    if "tokens" in skipped_fields:
        text_shown = text_shown | {"tokens": text["tokens"][:3] + 
                                             ["..."] if len(text["tokens"]) >= 3 else []}
    print(text_shown)

Next, we apply the cleanup function to the texts and store the results in the variable `text_cleaned`. We show the first text to check if the process was successful

In [11]:
texts_cleaned = [text | {"text_cleaned": cleanup_text(text["text_original"])} for text in texts]
show_example_text(texts_cleaned[0], skipped_fields=["text_original"])

{'id': 'C. 0115', 'data_source': 'EMT', 'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


After this, we apply the language derivation function to the texts and store the results in the variable `text_with_language_ids`. Again, we show the first text to check if the process was successful

In [12]:
texts_with_language_ids = [{"language_id": detect_text_language(text["text_cleaned"])} | text
                            for text in texts_cleaned]
show_example_text(texts_with_language_ids[0], skipped_fields=["text_original"])

{'language_id': 'en', 'id': 'C. 0115', 'data_source': 'EMT', 'text_cleaned': 'Statuette of the god Anubis. Bronze. Late Period (722-332 BC).. Acquired before 1882. C. 115'}


Finally, we apply the preprocess function to the texts and store the results in the variable `text_preprocessed`. We show the first text to check if the process was successful. The texts have been divided in sentences and tokens

In [13]:
texts_preprocessed = preprocess_texts(texts_with_language_ids)
show_example_text(texts_preprocessed[0], skipped_fields=["text_original", "text_cleaned", "tokens"])

{'meta': {'language_id': 'en', 'id': 'C. 0115', 'data_source': 'EMT', 'char_count': 5, 'token_count': 23, 'sentence_count': 4}, 'sentences': [{'id': 0, 'start': 0, 'end': 28, 'text': 'Statuette of the god Anubis.'}, {'id': 1, 'start': 29, 'end': 36, 'text': 'Bronze.'}, {'id': 2, 'start': 37, 'end': 85, 'text': 'Late Period (722-332 BC).. Acquired before 1882.'}, {'id': 3, 'start': 86, 'end': 92, 'text': 'C. 115'}], 'tokens': [{'id': 0, 'text': 'Statuette', 'start': 0, 'end': 9, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 1, 'text': 'of', 'start': 10, 'end': 12, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 2, 'text': 'the', 'start': 13, 'end': 16, 'ws': True, 'is_punct': False, 'sent_id': 0}, '...']}


## 4. Save results

When running the notebook locally, the preprocessed texts can be saved locally. When running the notebook on Google Colab, the results need to be downloaded because saved files on Google Colab will be removed automatically.

First we define a function for saving the texts

In [14]:
def save_results(texts):
    """Save preprocessed texts in a json file"""
    json_string = json.dumps(texts, ensure_ascii=False, indent=2)
    hash = hashlib.sha1(json_string.encode("utf-8")).hexdigest()
    output_file_name = f"output_{hash}.json"
    with open(output_file_name, "w", encoding="utf-8") as output_file:
        print(json_string, end="", file=output_file)
        output_file.close()
        if IN_COLAB:
            try:
                files.download(output_file_name)
                print(f"️{char_success} Downloaded preprocessed texts to file {output_file_name}")
            except:
                print(f"️{char_failiure} Downloading preprocessed texts failed!")
        else:
            print(f"️{char_success} Saved preprocessed texts to file {output_file_name}")

Next, we call the function to save the texts and their metadata

In [15]:
save_results(texts_preprocessed)

️✅ Saved preprocessed texts to file output_11f98441067263d80ee1a6bac27babf0f2c6734b.json


## 5. Alternatives for reading text data

Here are some methods for reading texts as alternatives for reading them from a museum website as presented in chapter 2 of this notebook. Run these code blocks instead of the code blocks of chapter 2 and then proceed with chapter 3.

### 5.1. Text examples defined in the code

We use descriptions from three famous artworks from Wikipedia, written in other languages than English.

First, we define the texts. We include identifiers in their descriptions

In [16]:
texts = [{"id": 1, "data_source": "wikipedia", "text_original": """De Nachtwacht is een schuttersstuk van de 
           Hollandse schilder Rembrandt van Rijn (1606-1669) dat in 1642 gereed kwam. De huidige officiële 
           titel luidt: Officieren en andere schutters van wijk II in Amsterdam, onder leiding van kapitein 
           Frans Banninck Cocq en luitenant Willem van Ruytenburch, bekend als ‘De Nachtwacht’."""},
         {"id": 2, "data_source": "wikipedia", "text_original": """La Gioconda, nota anche come Monna Lisa, 
           è un dipinto a olio su tavola di pioppo realizzato da Leonardo da Vinci (77 × 53 cm e 13 mm di 
           spessore), databile al 1503-1506 circa e conservato nel Museo del Louvre di Parigi col numero 779 
           di catalogo."""},
         {"id": 3, "data_source": "wikipedia", "text_original": """Le Penseur (initialement intitulé Le Poète) 
           est un des chefs-d'œuvre emblématiques d'Auguste Rodin."""}]

Next, we check if the definition worked and display the first text

In [17]:
if len(texts) <= 0:
    print(f"{char_failiure} No texts found!")
else:
    print("Text found:", texts[0])

Text found: {'id': 1, 'data_source': 'wikipedia', 'text_original': 'De Nachtwacht is een schuttersstuk van de \n           Hollandse schilder Rembrandt van Rijn (1606-1669) dat in 1642 gereed kwam. De huidige officiële \n           titel luidt: Officieren en andere schutters van wijk II in Amsterdam, onder leiding van kapitein \n           Frans Banninck Cocq en luitenant Willem van Ruytenburch, bekend als ‘De Nachtwacht’.'}


### 5.2. Read a single text from a text file

We use a description of a monument from Wikipedia, written in a different language than English

We start with defining a function for reading the text from a file

In [18]:
file_name = "wikipedia.txt"
data_source = "wikipedia"


def read_wikipedia_file(file_name):
    """Read a text from Wikipedia from a text file"""
    try:
        with open(file_name, "r") as infile:
            text = infile.read().strip()
            infile.close()
        return [{"id": 1, "data_source": data_source, "text_original": text}]
    except:
        print(f"{char_failiure} Cannot read data from file {file_name}!")
        return []

Next, we call the function. This requires that a file with the specified file name in present

In [19]:
texts = read_wikipedia_file(file_name)
if len(texts) <= 0:
    print(f"{char_failiure} No texts found in file {file_name}!")
else:
    print("Text found:", texts[0])

Text found: {'id': 1, 'data_source': 'wikipedia', 'text_original': 'Der Kölner Dom (offiziell Hohe Domkirche zu Köln) ist eine römisch-katholische Kirche in Köln unter dem Patrozinium des Apostels Petrus. Er ist die Kathedrale des Erzbistums Köln sowie Metropolitan kirche der Kirchenprovinz Köln. Hausherr ist der Dompropst. Der Kölner Dom ist eine der größten Kathedralen im gotischen Baustil. Sein Bau wurde 1248 im Auftrag von Konrad I. nach Entwurf von Meister Gerhard begonnen und 1880 im Auftrag von Friedrich Wilhelm IV. nach Entwurf von Ernst Friedrich Zwirner vollendet. Einige Kunsthistoriker haben den Dom wegen seiner einheitlichen und ausgewogenen Bauform als „vollkommene Kathedrale“ bezeichnet. Mit 157,22 Metern ist er nach dem Ulmer Münster der zweithöchste Sakralbau Deutschlands und hinter der Basilika Notre-Dame-de-la-Paix de Yamoussoukro die dritthöchste Kirche der Welt.'}


### 5.3. Download text from museum website

We use the description of a painting by Claude Monet from the artwork Cleveland Museum website as example text

First, we define a function for reading the texts from the website

In [20]:
base_url = "https://openaccess-api.clevelandart.org/api/artworks"
data_source = "CMA"


def fetch_cma(query_string: str) -> List[Dict[str, Any]]:
    """Fetch metadata from Cleveland Museum of Art website"""
    try:
        response = requests.get(base_url, params={"q": query_string, "skip": 0, "limit": 100}, timeout=10)
    except:
        print(f"{char_failiure} Cannot download data from {base_url}")
        return []
    response.raise_for_status()
    artworks = response.json().get("data", [])
    if artworks == []:
        return []
    else:
        return [{"id": artworks[0].get("id"), 
                 "data_source": data_source,
                 "text_original": (artworks[0].get("description") or "").strip()}]

Next, we call the function and store it

In [21]:
texts = fetch_cma("monet")
if not texts:
    print("{char_failiure} No texts were found. Are you connected to the internet?")
else:
    print("Text found:", texts[0])

Text found: {'id': 136510, 'data_source': 'CMA', 'text_original': 'A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France. The resultant canvases are notable for their varied motifs, formats, and sizes. Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface. By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format that fills the viewer’s field of vision.'}
