# Application of notebooks in digital twins

This use case will demonstrate how data relevant to various digital twins models can be supported by notebooks to provide examples of use.

This approach employs a workflow as shown in the following picture.

<img src="imgs/workflow.png" width="100%">

### Available Datasets 

This section describes existing collections of maps made available by relevant institutions.

For instance, [KU Leuven Libraries](https://bib.kuleuven.be/english/heritage/heritagecollections/types-of-material/maps-and-atlases) hosts more than 20,000 maps and atlases with a focus on Belgian territory from the 16th century to the present. These materials show us how the landscape used to look and thus offer essential information for geographers and historians. They provide [documentation](https://bib.kuleuven.be/english/heritage/how-to-search/how-to-search-maps) to show how to access and search the maps.

<img width="30%" src="https://bib.kuleuven.be/bijzondere-collecties/images/erfgoed_oude_drukken/r3a-19840-000332.jpg/@@images/image-400-aaaa9986e11af72ae305e9b4e829de13.jpeg">

The [Royal Danish Library](https://www.kb.dk/en/find-materials/collections/map-collection) also provides access to a collection of maps. The oldest maps in the collection date from the 16th century. For example, by using the [following link](https://soeg.kb.dk/discovery/search?query=any,contains,danmark&tab=Everything&search_scope=MyInst_and_CI&vid=45KBDK_KGL:KGL&facet=rtype,include,maps&lang=en&offset=10&came_from=pagination_1_2), we can retrieve maps related to Denmark. The following image shows an example.

<img width="50%" src="https://kb-images.kb.dk/DAMJP2/DAM/Maps/0000/069/459/DK003600/full/full/0/native.jpg">

The [National Library of Spain](https://bnedigital.bne.es/bd/es/results?y=s&o=&o=&w=mapa&w=&f=ficha&f=texto_ficha&g=ws&f4=Material+cartogr%C3%A1fico+manuscrito) provides access to a collection of maps, including textual metadata, that can be exported as a txt file. The [following link](https://bnedigital.bne.es/bd/es/export?y=s&o=&o=&w=mapa&w=&f=ficha&f=texto_ficha&g=ws&f4=Material+cartogr%C3%A1fico+manuscrito&x=adefadbf-b10b-4a34-a0d7-98513056a7b3) provides access to the metadata of the collection extracted as textual documentation. Note that most of the records are provided under a CC0 licence.

<img width="30%" src="https://bnedigital.bne.es/bd/es/medium?id=d5a51609-f45a-48f0-836e-1bffe32430f7">

Here we can see an overview the metadata provided by the National Library of Spain. As we can see the metadata is limited to the title, authors, dates and some notes.

```
Registro 1

    Título:              [Mapa itinerario de Guipúzcoa]

    Tipo de documento:   Material cartográfico manuscrito

    Autoría:             

    Fecha:               [18­-]

    Materia:             

    Descripción física:  1 mapa : ms., col.

    Signatura:           MR/42/471

    MMS ID:              991000586059708606

    Identificador corto: 0174194456

    CDU:                 (466.2)

    URL:                 https://bnedigital.bne.es/bd/es/card?id=2d3c9301-a1d1-45de-ae3c-8d7c5c02221c
------------------------------------------------------------------------------------------
```

## We will use the National Library of Spain as example

### Retrieve Metadata

This step involves the retrievement of the metadata from the National Library of Spain. The web interface was employed to search the records typed as cartographic material. Then, the export link was used to download the metadata which corresponds to a text and human-readable format.

In [60]:
url = 'https://bnedigital.bne.es/bd/es/export?o=&o=o&o=n&o=&o=o&o=n&w=&w=&w=&w=&w=&w=&f=ficha&f=ficha&f=ficha&f=ficha&f=ficha&f=ficha&p=&f4=Material+cartogr%C3%A1fico+manuscrito&g=ws&g=dd&g=ld&g=pd&g=pg&g=hh&g=fa&d=date&d=&d=&startYear=&endYear=&year=&l=10&x=adefadbf-b10b-4a34-a0d7-98513056a7b3'

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [61]:
import requests

resp = requests.get(url=url)
#print (resp.text) ## uncomment to see the content

with open("data/maps-bne/metadata-bne.txt", "w") as text_file:
    text_file.write(resp.text)

#### Extraction

Now we transform the human-readable text to a CSV so the data can be analysed and adapted easily. For this, we create a CSV file, each row containing a record.

In [62]:
txtfile = "data/maps-bne/metadata-bne.txt"

import csv

with open('data/maps-bne/metadata-bne.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    field = ["title", "type", "author", "date", "subject", "description", "id", "url"]
    writer.writerow(field)

    title = author = date = subject = description = id = url = ''
    
    with open(txtfile) as file: # loop all the rows and add the metadata fields
        for line in file:
            #print(line.rstrip())
            if "Identificador corto" in line:
                id = line.split(": ")[1]
                #print("id:" + id)
            elif "Título:" in line:
                title = line.split(": ")[1]
                #print("title:" + title)
            elif "Autoría" in line:
                author = line.split(": ")[1]
                #print("author:" + author)
            elif "Descripción física" in line:
                description = line[line.index(":"):]
                #print("description:" + description)
            elif "Materia:" in line:
                subject = line.split(": ")[1]
                #print("subject:" + subject)
            elif "Fecha:" in line:
                date = line.split(": ")[1]
            elif "URL:" in line: # last field of each record
                url = line.split(": ")[1]
                #print("url:" + url)
                writer.writerow([title,"map",author,date,subject,description,id,url])
                title = author = date = subject = description = id = url = ''

### Extract OCR from images

For the text extraction we are going to use a LLM approach, in particular, Ollama. We will test additional LLMs such as llava, deepseek, Mistral and QWEN. 

In [None]:
!pip install ollama-ocr

https://bnedigital.bne.es/bd/card?oid=0000000735
https://bnedigital.bne.es/bd/card?oid=0000001402

In [None]:
from ollama_ocr import OCRProcessor

# Initialize OCR processor
ocr = OCRProcessor(model_name='llama3.2-vision:11b')  # You can use any vision model available on Ollama
#ocr = OCRProcessor(model_name='llava:7b')
ocr = OCRProcessor(model_name='deepseek-ocr')

# Process an image
result = ocr.process_image(
    image_path="imgs/Isle-Saint-Domingue.jpg",
    format_type="markdown"  # Options: markdown, text, json, structured, key_value
)
print(result)

### Transform the metadata to RDF

In this step we transform the human-readable metadata to machine-readable by using [Resource Description Framework (RDF)](https://www.w3.org/TR/rdf12-concepts/), a standard to publish data in the form of triplets promoted by the W3C.

Following other approaches and previous work, the python library RDFLib is employed to create the RDF data by means of Schema.org as main vocabulary.

Note that as an example we use the National Library of Spain. Additional datasets could be used following a similar approach, in order to create a larger dataset with data integrated from several institutions.

In [15]:
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import FOAF, RDF, RDFS, DCTERMS, VOID, DC, SKOS, OWL, XSD
import datetime

g = Graph()
g.bind("foaf", FOAF)
g.bind("rdf", RDF)
g.bind("rdfs", RDFS)
g.bind("dcterms", DCTERMS)
g.bind("dc", DC)
g.bind("void", VOID)
g.bind("skos", SKOS)
g.bind("owl", OWL)

schema = Namespace("https://schema.org/")
g.bind("schema", schema)

dcat = Namespace("http://www.w3.org/ns/dcat#")
g.bind("dcat", dcat)

wd = Namespace("http://www.wikidata.org/entity/")
g.bind("wd", wd)

domain = 'https://example.org/'
domainLanguage = domain + 'map/'

##### We add the metadata of the dataset

In [16]:
# First, we create all the required URIS
catalog = URIRef(domain + "catalog/digital-twins")
bne = URIRef(domain + "organization/bneXXXXXXXXXXXXXXXXXXXX")
dataset = URIRef(domain + "dataset/digital-twins")
digital_twins_csv = URIRef(domain + "distribution/digital-twins-csv")
digital_twins_ttl = URIRef(domain + "distribution/digital-twins-ttl")
digital_twins_txt = URIRef(domain + "distribution/digital-twins-txt")

In [17]:
# We describe the dataset
g.add((catalog, RDF.type, schema.Dataset))
g.add((catalog, RDF.type, dcat.catalog))
g.add((catalog, RDFS.label, Literal("Maps from the National Library of Spain")))
g.add((catalog, schema.url, URIRef("https://bnedigital.bne.es/")))
g.add((catalog, FOAF.homepage, URIRef("https://bnedigital.bne.es/")))
g.add((catalog, schema.description, Literal("Maps from the National Library of Spain")))
g.add((catalog, schema.name, Literal("Maps from the National Library of Spain")))
g.add((catalog, DCTERMS.title, Literal("Maps from the National Library of Spain")))
g.add((catalog, DCTERMS.publisher, URIRef(bne))) # relation dataset-publisher
g.add((catalog, DC.title, Literal("National Library of Spain")))
g.add((catalog, schema.license, URIRef('https://creativecommons.org/publicdomain/zero/1.0/')))

now = datetime.datetime.now()
g.add((catalog, schema.dateCreated, Literal(str(now)[:10])))

<Graph identifier=Ne8c3760cc88544b3a0b3abc64fab124e (<class 'rdflib.graph.Graph'>)>

In [18]:
# We describe the BNE
g.add((bne, RDF.type, FOAF.Organization))
g.add((bne, RDFS.label, Literal("National Library of Spain")))
g.add((bne, FOAF.homepage, URIRef("https://bnedigital.bne.es/")))

<Graph identifier=Ne8c3760cc88544b3a0b3abc64fab124e (<class 'rdflib.graph.Graph'>)>

In [19]:
# We describe the dataset
g.add((dataset, RDF.type, dcat.Dataset))
g.add((dataset, DCTERMS.title, Literal("Maps from the National Library of Spain", lang="en")))
g.add((dataset, dcat.keyword, Literal("Maps")))
g.add((dataset, dcat.keyword, Literal("Cartographic material")))
g.add((dataset, dcat.keyword, Literal("Collections as data")))
g.add((dataset, DCTERMS.issued, Literal(str(now)[:10])))
g.add((dataset, DCTERMS.language, URIRef("http://id.loc.gov/vocabulary/iso639-1/es")))
g.add((dataset, dcat.distribution, URIRef(digital_twins_csv)))

<Graph identifier=Ne8c3760cc88544b3a0b3abc64fab124e (<class 'rdflib.graph.Graph'>)>

In [20]:
# We describe the distributions CSV and TTL 
g.add((digital_twins_csv, RDF.type, dcat.Distribution))
g.add((digital_twins_csv, dcat.downloadURL , URIRef("https://raw.githubusercontent.com/hibernator11/eccch-use-cases/refs/heads/main/data/maps-bne/metadata-bne.csv")))
g.add((digital_twins_csv, DCTERMS.title, Literal("CSV distribution of BNE cartographic material", lang="en")))
g.add((digital_twins_csv, DCTERMS.title, Literal("Distribución en CSV del conjunto de datos de mapas de la BNE", lang="es")))
g.add((digital_twins_csv, dcat.mediaType, URIRef("http://www.iana.org/assignments/media-types/text/csv")))
g.add((digital_twins_csv, dcat.byteSize, Literal('300000', datatype=XSD.integer)))

g.add((digital_twins_ttl, RDF.type, dcat.Distribution))
g.add((digital_twins_ttl, dcat.downloadURL , URIRef("https://raw.githubusercontent.com/hibernator11/eccch-use-cases/refs/heads/main/data/maps-bne/dataset_bne.ttl")))
g.add((digital_twins_ttl, DCTERMS.title, Literal("TTL distribution of BNE cartographic material", lang="en")))
g.add((digital_twins_ttl, DCTERMS.title, Literal("Distribución en TTL del conjunto de datos de mapas de la BNE", lang="es")))
g.add((digital_twins_ttl, dcat.mediaType, URIRef("http://www.iana.org/assignments/media-types/application/n-triples")))
g.add((digital_twins_ttl, dcat.byteSize, Literal('528000', datatype=XSD.integer)))

g.add((digital_twins_txt, RDF.type, dcat.Distribution))
g.add((digital_twins_txt, dcat.downloadURL , URIRef("https://raw.githubusercontent.com/hibernator11/eccch-use-cases/refs/heads/main/data/maps-bne/metadata-bne.txt")))
g.add((digital_twins_txt, DCTERMS.title, Literal("TXT distribution of BNE cartographic material", lang="en")))
g.add((digital_twins_txt, DCTERMS.title, Literal("Distribución en TXT del conjunto de datos de mapas de la BNE", lang="es")))
g.add((digital_twins_txt, dcat.mediaType, URIRef("https://www.iana.org/assignments/media-types/text/plain")))
g.add((digital_twins_txt, dcat.byteSize, Literal('671000', datatype=XSD.integer)))

<Graph identifier=Ne8c3760cc88544b3a0b3abc64fab124e (<class 'rdflib.graph.Graph'>)>

#### Read the CSV file and transform the records
Now we will transform the records using several classes and properties from differentes vocabularies and ontologies

In [None]:
with open('data/maps-bne/metadata-bne.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        #print(row['id'], row['type'])
        
        classtype = 'https://schema.org/CreativeWork'
        
        record = URIRef(domain + "record/" + str(row["id"]).strip())
        g.add((record, RDF.type, URIRef(classtype)))
        g.add((record, schema.sourceOrganization, Literal("National Library of Spain")))
        g.add((record, schema.isPartOf, maps))
        g.add((record, schema.description, Literal(row['description'].strip())))
        if row['author'].strip() != "":
            g.add((record, schema.author, Literal(row['author'].strip())))
        g.add((record, schema.identifier, Literal(row['id'].strip())))
        g.add((record, schema.datePublished, Literal(row['date'].strip())))
        g.add((record, schema.name, Literal(row['title'].strip())))
        g.add((record, schema.url, URIRef(row['url'].strip())))
        g.add((record, schema.license, URIRef('https://creativecommons.org/publicdomain/zero/1.0/')))

g.serialize(destination="data/maps-bne/dataset_bne.ttl")

#### Now we can query the graph using SPARQL

In [None]:
print('##### Properties:')

# Query the data in g using SPARQL
q = """
    SELECT distinct ?prop
    WHERE {
        ?s ?prop ?o .
    }
"""

# Apply the query to the graph and iterate through results
for r in g.query(q):
    print(r["prop"])

As an example we can retrieve the metadata of our dataset

In [None]:
print('##### Dataset information:')

# Query the data in g using SPARQL
q = """
    PREFIX schema: <https://schema.org/>
    SELECT distinct ?p ?o
    WHERE {
        ?s a schema:Dataset .
        ?s ?p ?o
    }
"""

# Apply the query to the graph and iterate through results
for r in g.query(q):
    print(r["p"] + ": " + r["o"])

### Integration with the ECCCH

[ECHOES](https://www.echoes-eccch.eu/faq/) is building a federated Knowledge Graph to allow for high level integration of resources. It will also serve as an entry point for all queries and requests related to any kind of information available within the Cultural heritage Cloud. The Knowledge Graph will use the proposed Heritage Digital Twin Ontology (HDTO) to unify descriptions and facilitate query and navigation. The current version of the ECHOES HDTO is available [here](https://www.echoes-eccch.eu/wp-content/uploads/2025/06/ECHOES_HDT_Ontology.pdf). The main vocabulary employed to describe the resources is [CIDOC-CRM](https://cidoc-crm.org/).

The following illustration shows how we modelled the outputs of this work in order to be integrated into the ECCCH.
<img width="80%" src="imgs/eccch-integration-steps.png">

In [1]:
## To Be done!
# Integration with the ECCCH once the API and fina data model is available

### Publication & dissemination

This step involves the publication of the results obtained including the dataset and this notebook in different platforms such as the Social Sciences and Humanities Open Marketplace and Zenodo.


As an example, we will use the sandbox service of Zenodo. Note that if you want to use this code for production purposes, it is required to update the URL. First, we need to create an access token in this [link](https://zenodo.org/account/settings/applications/tokens/new/). Note that we also need a token for the sandbox Zenodo.

In [43]:
# https://developers.zenodo.org/
import requests
ACCESS_TOKEN = 'ChangeMe'

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {ACCESS_TOKEN}"
}
r = requests.post('https://sandbox.zenodo.org/api/deposit/depositions',
                   json={},
                   headers=headers)
r.status_code
r.json()

{'created': '2026-01-12T17:41:07.632652+00:00',
 'modified': '2026-01-12T17:41:07.740296+00:00',
 'id': 424692,
 'conceptrecid': '424691',
 'metadata': {'access_right': 'open',
  'prereserve_doi': {'doi': '10.5281/zenodo.424692', 'recid': 424692}},
 'title': '',
 'links': {'self': 'https://sandbox.zenodo.org/api/deposit/depositions/424692',
  'html': 'https://sandbox.zenodo.org/deposit/424692',
  'badge': 'https://sandbox.zenodo.org/badge/doi/.svg',
  'files': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/files',
  'bucket': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c',
  'latest_draft': 'https://sandbox.zenodo.org/api/deposit/depositions/424692',
  'latest_draft_html': 'https://sandbox.zenodo.org/deposit/424692',
  'publish': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/actions/publish',
  'edit': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/actions/edit',
  'discard': 'https://sandbox.zenodo.org/api/deposit/depos

Now, let’s upload a new file:

In [44]:
bucket_url = r.json()["links"]["bucket"]
deposition_id = r.json()["id"]

First, we create a zip file with the notebook and the requirements.txt file:

In [45]:
from zipfile import ZipFile

# List of files to include in the archive
file_list = ["Digital-Twins.ipynb", "requirements.txt"]

# Create ZIP file and write files into it
with ZipFile("output.zip", "w") as zipf:
   for file in file_list:
      zipf.write(file)

Then, we call the API:

In [47]:
filename = "output.zip"
path = "%s" % filename
headers = {'Authorization': f'Bearer {ACCESS_TOKEN}'}

''' 
The target URL is a combination of the bucket link with the desired filename
seperated by a slash.
'''
with open(path, "rb") as fp:
    r = requests.put(
        "%s/%s" % (bucket_url, filename),
        data=fp,
        headers=headers,
    )
r.json()

{'created': '2026-01-12T17:41:16.912746+00:00',
 'updated': '2026-01-12T17:41:17.052565+00:00',
 'version_id': 'f8cd6099-25aa-4105-aa41-91921a295466',
 'key': 'output.zip',
 'size': 28194,
 'mimetype': 'application/zip',
 'checksum': 'md5:4df29214004c115c3e7a0ab4a433da97',
 'is_head': True,
 'delete_marker': False,
 'links': {'self': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip',
  'version': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip?version_id=f8cd6099-25aa-4105-aa41-91921a295466',
  'uploads': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip?uploads=1'}}

We can also add metadata to the record:

In [49]:
data = {
     'metadata': {
         'title': 'Using and reusing notebooks in high-performance computing environments',
         'upload_type': 'software',
         'description': 'This use case shows how to reuse a collections of maps made available by the National Library of Spain following a set of steps in the form of a reproducible workflow: extraction, OCR analysis using LLMs, metadata generation and dissemination',
         'creators': [{'name': 'Candela, Gustavo',
                       'affiliation': 'University of Alicante'}]
     }
 }
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {ACCESS_TOKEN}'
}
r = requests.put('https://sandbox.zenodo.org/api/deposit/depositions/%s' % deposition_id,
                  data=json.dumps(data),
                  headers=headers)
r.status_code

200

The last step is the publication:

In [50]:
headers = {'Authorization': f'Bearer {ACCESS_TOKEN}'}
r = requests.post('https://sandbox.zenodo.org/api/deposit/depositions/%s/actions/publish' % deposition_id,
                      headers=headers)
r.status_code
# 202

202

And now we can see the result in Zenodo:

<img src="imgs/zenodo-publication.png" width="70%">

We can reproduce the same with additional platforms such as [Wikidata](https://www.wikidata.org/) and the [Social Sciences and Humanities Open Marketplace](https://marketplace.sshopencloud.eu/about/api-documentation)

In the particular case of Wikidata, existing [python libraries](https://www.mediawiki.org/wiki/Manual:Pywikibot/Wikidata) can be used to extract and create entities.

### References

- Candela, G., Rosiński, C., & Margraf, A. (2025). A reproducible framework to publish and reuse Collections as data: the case of the European Literary Bibliography (Version 4, Vol. 965, Issue 170). Transformations: A DARIAH Journal . https://doi.org/10.46298/transformations.14729
- Gustavo Candela, Javier Pereda, Dolores Sáez, Pilar Escobar, Alexander Sánchez, Andrés Villa Torres, Albert A. Palacios, Kelly McDonough, and Patricia Murrieta-Flores. 2023. An Ontological Approach for Unlocking the Colonial Archive. J. Comput. Cult. Herit. 16, 4, Article 74 (December 2023), 18 pages. https://doi.org/10.1145/3594727
- Niccolucci F, Markhoff B, Theodoridou M et al. The Heritage Digital Twin: a bicycle made for two. The integration of digital methodologies into cultural heritage research [version 1; peer review: 2 approved with reservations]. Open Res Europe 2023, 3:64 (https://doi.org/10.12688/openreseurope.15496.1)
- https://developers.zenodo.org/#quickstart-upload
- https://www.echoes-eccch.eu/wp-content/uploads/2025/06/ECHOES_HDT_Ontology.pdf
- https://marketplace.sshopencloud.eu/about/api-documentation
- https://www.wikidata.org/
- https://cidoc-crm.org/sites/default/files/CRMdigv4.0.pdf
- https://arxiv.org/html/2503.02167v1