# AI preservation

This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks. AI preservation in general is still a largely unaddressed area of work, despite its rapid integration into digital preservation tools and workflows. Its use across domains of research makes it critical to the reproducibility of results, and its proper long-term management an essential condition for the future robustness of ECCCH.

This approach employs a workflow as shown in the following picture.

<img src="imgs/workflow.png" width="100%">

### The following sections describe a selection of examples of how we could address AI preservation from different perspectives.

#### Automatic generation of Model Cards and datasheets

Model cards are files that are provided as part of models including provenance documentation such as how they were trained. Model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibilityHugging Face promotes its use  [https://huggingface.co/docs/hub/model-cards]. Previous work have addressed the [automatic generation of model cards](https://github.com/jiarui-liu/AutomatedModelCardGeneration) using LLMs and the text from the article and the code and documentation provided.

A more advanced and detailed approach is [datasheets for cultural heritage datasets](https://pro.europeana.eu/project/datasheets-for-digital-cultural-heritage-working-group, an iniciative from Europeana to describe datasets and AI outputs. Datasheets include a structure to describe datasets that cover a wide diversity of categories such as authorship, provenance, biases, examples of use, contact, etc. The creation of datasheets from scratch requires a certain amount of time that could be leveraged by using LLM to create a preliminary draft. 

In [23]:
## TODO LLM training and assessment to generate datasheets

#### Storage of AI model and data

By storing and preserving metadata, the AI model and the data used to train the model potential future users are enabled to reproduce the same results, using the same version of software libraries. An analysis is required in order to identify the amount of data required to store as well as the envelope or package in which all the data will be stored. Approaches from other fields will be analysed such as Web Archive and software development.  

In [24]:
## TODO development of a reusable and standard package to enable AI preservation

### Integration with the ECCCH

[ECHOES](https://www.echoes-eccch.eu/faq/) is building a federated Knowledge Graph to allow for high level integration of resources. It will also serve as an entry point for all queries and requests related to any kind of information available within the Cultural heritage Cloud. The Knowledge Graph will use the proposed Heritage Digital Twin Ontology (HDTO) to unify descriptions and facilitate query and navigation. The current version of the ECHOES HDTO is available [here](https://www.echoes-eccch.eu/wp-content/uploads/2025/06/ECHOES_HDT_Ontology.pdf). The main vocabulary employed to describe the resources is [CIDOC-CRM](https://cidoc-crm.org/).

The following illustration shows how we modelled the outputs of this work in order to be integrated into the ECCCH.
<img width="80%" src="imgs/eccch-integration-steps.png">

And the following picture shows how we modelled the data using the vocabularies and ontologies. The class prov:Entity represents the code in the form of Jupyter Notebook which was generated by means of a prov:Activity, describing the work in this code, using a distribution of a dataset as an input (txt), and generating another distribution (ttl). The distributions are part of a dataset, which is also part of a catalog that was published by an organization.

<img width="80%" src="imgs/data-model-cidoc.png">

In [15]:
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import FOAF, RDF, RDFS, DCTERMS, VOID, DC, SKOS, OWL, XSD
import datetime

g = Graph()
g.bind("foaf", FOAF)
g.bind("rdf", RDF)
g.bind("rdfs", RDFS)
g.bind("dcterms", DCTERMS)
g.bind("dc", DC)
g.bind("void", VOID)
g.bind("skos", SKOS)
g.bind("owl", OWL)

schema = Namespace("https://schema.org/")
g.bind("schema", schema)

dcat = Namespace("http://www.w3.org/ns/dcat#")
g.bind("dcat", dcat)

wd = Namespace("http://www.wikidata.org/entity/")
g.bind("wd", wd)

cidoc_crm = Namespace("http://www.cidoc-crm.org/cidoc-crm/")
g.bind("cidoc-crm", cidoc_crm)

prov = Namespace("http://www.w3.org/ns/prov#")
g.bind("prov", prov)

domain = 'https://example.org/'
domainLanguage = domain + 'map/'

##### We add the metadata of the dataset

In [16]:
# First, we create all the required URIS
ai_catalog = URIRef(domain + "catalog/ai")
ai_org = URIRef(domain + "organization/ai")
ai_dataset = URIRef(domain + "dataset/ai")
ai_ttl = URIRef(domain + "distribution/ai-ttl")

In [17]:
# We describe the dataset
g.add((ai_catalog, RDF.type, schema.Dataset))
g.add((ai_catalog, RDF.type, dcat.catalog))
g.add((ai_catalog, RDFS.label, Literal("AI preservation examples")))
g.add((ai_catalog, schema.description, Literal("This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.")))
g.add((ai_catalog, schema.name, Literal("This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.")))
g.add((ai_catalog, DCTERMS.title, Literal("This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.")))
g.add((ai_catalog, DCTERMS.publisher, URIRef(ai_org))) # relation dataset-publisher
g.add((ai_catalog, DC.title, Literal("OpenAIRE")))
g.add((ai_catalog, schema.license, URIRef('https://creativecommons.org/licenses/by/4.0/')))
g.add((ai_catalog, dcat.dataset, ai_dataset))

now = datetime.datetime.now()
g.add((ai_catalog, schema.dateCreated, Literal(str(now)[:10])))

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

In [18]:
# We describe a working group from glamlabs.io
g.add((ai_org, RDF.type, FOAF.Organization))
g.add((ai_org, RDFS.label, Literal("AI preservation group")))
g.add((ai_org, FOAF.homepage, URIRef("https://glamlabs.io/")))

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

In [19]:
# We describe the dataset
g.add((ai_dataset, RDF.type, dcat.Dataset))
g.add((ai_dataset, DCTERMS.title, Literal("LLM model to generate model cards and datasheets for AI models", lang="en")))
g.add((ai_dataset, dcat.keyword, Literal("Research products")))
g.add((ai_dataset, dcat.keyword, Literal("LLM")))
g.add((ai_dataset, dcat.keyword, Literal("Data")))
g.add((ai_dataset, DCTERMS.issued, Literal(str(now)[:10])))
g.add((ai_dataset, DCTERMS.language, URIRef("http://id.loc.gov/vocabulary/iso639-1/en")))
g.add((ai_dataset, dcat.distribution, URIRef(ai_ttl)))

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

In [20]:
# We describe the TTL distribution 
g.add((ai_ttl, RDF.type, dcat.Distribution))
g.add((ai_ttl, dcat.downloadURL , URIRef("https://raw.githubusercontent.com/hibernator11/eccch-use-cases/refs/heads/main/data/ai/dataset_ai.ttl")))
g.add((ai_ttl, DCTERMS.title, Literal("TTL distribution of AI preservation", lang="en")))
g.add((ai_ttl, DCTERMS.title, Literal("Distribución en TTL del conjunto de datos de AI preservation", lang="es")))
g.add((ai_ttl, dcat.mediaType, URIRef("http://www.iana.org/assignments/media-types/application/n-triples")))
g.add((ai_ttl, dcat.byteSize, Literal('260000', datatype=XSD.integer)))

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

### Now we link the notebooks and the distributions of the dataset created

In [21]:
preservation_work = URIRef(domain + "preservation-work/openaire")
notebooks = URIRef(domain + "notebooks/openaire")
g.add((notebooks, RDF.type, prov.Entity))
g.add((notebooks, DCTERMS.title, Literal("AI preservation", lang="en")))
g.add((notebooks, prov.wasGeneratedBy, URIRef(preservation_work)))
g.add((notebooks, dcat.mediatype, URIRef("https://www.iana.org/assignments/media-types/application/x-ipynb+json")))

now = datetime.datetime.now()
g.add((preservation_work, RDF.type, prov.Activity))
g.add((preservation_work, prov.startedAtTime, Literal(str(now)[:10])))
g.add((preservation_work, prov.generated, URIRef(ai_ttl)))
g.add((preservation_work, prov.endedAtTime, Literal(str(now)[:10])))

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

#### Store the data

In [22]:
g.serialize(destination="data/ai/dataset_ai.ttl")

<Graph identifier=Nf22eb6a8cb8541fbb1a0ce33e1047c72 (<class 'rdflib.graph.Graph'>)>

#### Now we can query the graph using SPARQL

In [12]:
print('##### Properties:')

# Query the data in g using SPARQL
q = """
    SELECT distinct ?prop
    WHERE {
        ?s ?prop ?o .
    }
"""

# Apply the query to the graph and iterate through results
for r in g.query(q):
    print(r["prop"])

##### Properties:
http://www.w3.org/ns/dcat#downloadURL
http://www.w3.org/ns/dcat#byteSize
http://purl.org/dc/terms/title
https://schema.org/name
http://www.w3.org/ns/dcat#dataset
http://www.w3.org/ns/prov#wasGeneratedBy
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/ns/prov#generated
https://schema.org/license
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/ns/dcat#distribution
https://schema.org/dateCreated
http://www.w3.org/ns/prov#startedAtTime
http://www.w3.org/ns/dcat#keyword
http://purl.org/dc/terms/language
http://purl.org/dc/elements/1.1/title
http://www.w3.org/ns/dcat#mediatype
http://purl.org/dc/terms/publisher
https://schema.org/description
http://www.w3.org/ns/prov#endedAtTime
http://purl.org/dc/terms/issued
http://xmlns.com/foaf/0.1/homepage
http://www.w3.org/ns/dcat#mediaType


As an example we can retrieve the metadata of our dataset

In [13]:
print('##### Dataset information:')

# Query the data in g using SPARQL
q = """
    PREFIX schema: <https://schema.org/>
    SELECT distinct ?p ?o
    WHERE {
        ?s a schema:Dataset .
        ?s ?p ?o
    }
"""

# Apply the query to the graph and iterate through results
for r in g.query(q):
    print(r["p"] + ": " + r["o"])

http://www.w3.org/1999/02/22-rdf-syntax-ns#type:  does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: https://schema.org/Dataset does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: https://schema.org/Dataset does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type:  does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: http://www.w3.org/ns/dcat#catalog does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: http://www.w3.org/ns/dcat#catalog does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/2000/01/rdf-schema#label:  does not look like a valid URI, trying to serialize this will break.
http://www.w3.org/2000/01/rdf-schema#label: AI preserv

##### Dataset information:
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: https://schema.org/Dataset
http://www.w3.org/1999/02/22-rdf-syntax-ns#type: http://www.w3.org/ns/dcat#catalog
http://www.w3.org/2000/01/rdf-schema#label: AI preservation examples
https://schema.org/description: This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.
https://schema.org/name: This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.
http://purl.org/dc/terms/title: This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks.
http://purl.org/dc/terms/publisher: https://example.org/organization/ai
http://purl.org/dc/elements/1.1/title: OpenAIRE
https://schema.org/license: https://creativecommons.org/licenses/by/4.0/
http://www.w3.org/ns/dcat#dataset: https://example.org/dataset/ai
https://schema.org/dateCreated: 2026-

#### Finally, we can use the ECCCH API to publish the data generated.

In [1]:
## To Be done!
# Integration with the ECCCH once the API and fina data model is available

### Publication & dissemination

This step involves the publication of the results obtained including the dataset and this notebook in different platforms such as the Social Sciences and Humanities Open Marketplace and Zenodo.


As an example, we will use the sandbox service of Zenodo. Note that if you want to use this code for production purposes, it is required to update the URL. First, we need to create an access token in this [link](https://zenodo.org/account/settings/applications/tokens/new/). Note that we also need a token for the sandbox Zenodo.

In [43]:
# https://developers.zenodo.org/
import requests
ACCESS_TOKEN = 'ChangeMe'

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {ACCESS_TOKEN}"
}
r = requests.post('https://sandbox.zenodo.org/api/deposit/depositions',
                   json={},
                   headers=headers)
r.status_code
r.json()

{'created': '2026-01-12T17:41:07.632652+00:00',
 'modified': '2026-01-12T17:41:07.740296+00:00',
 'id': 424692,
 'conceptrecid': '424691',
 'metadata': {'access_right': 'open',
  'prereserve_doi': {'doi': '10.5281/zenodo.424692', 'recid': 424692}},
 'title': '',
 'links': {'self': 'https://sandbox.zenodo.org/api/deposit/depositions/424692',
  'html': 'https://sandbox.zenodo.org/deposit/424692',
  'badge': 'https://sandbox.zenodo.org/badge/doi/.svg',
  'files': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/files',
  'bucket': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c',
  'latest_draft': 'https://sandbox.zenodo.org/api/deposit/depositions/424692',
  'latest_draft_html': 'https://sandbox.zenodo.org/deposit/424692',
  'publish': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/actions/publish',
  'edit': 'https://sandbox.zenodo.org/api/deposit/depositions/424692/actions/edit',
  'discard': 'https://sandbox.zenodo.org/api/deposit/depos

Now, let’s upload a new file:

In [44]:
bucket_url = r.json()["links"]["bucket"]
deposition_id = r.json()["id"]

First, we create a zip file with the notebook and the requirements.txt file:

In [45]:
from zipfile import ZipFile

# List of files to include in the archive
file_list = ["AI-Preservation.ipynb", "requirements.txt"]

# Create ZIP file and write files into it
with ZipFile("output.zip", "w") as zipf:
   for file in file_list:
      zipf.write(file)

Then, we call the API:

In [47]:
filename = "output.zip"
path = "%s" % filename
headers = {'Authorization': f'Bearer {ACCESS_TOKEN}'}

''' 
The target URL is a combination of the bucket link with the desired filename
seperated by a slash.
'''
with open(path, "rb") as fp:
    r = requests.put(
        "%s/%s" % (bucket_url, filename),
        data=fp,
        headers=headers,
    )
r.json()

{'created': '2026-01-12T17:41:16.912746+00:00',
 'updated': '2026-01-12T17:41:17.052565+00:00',
 'version_id': 'f8cd6099-25aa-4105-aa41-91921a295466',
 'key': 'output.zip',
 'size': 28194,
 'mimetype': 'application/zip',
 'checksum': 'md5:4df29214004c115c3e7a0ab4a433da97',
 'is_head': True,
 'delete_marker': False,
 'links': {'self': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip',
  'version': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip?version_id=f8cd6099-25aa-4105-aa41-91921a295466',
  'uploads': 'https://sandbox.zenodo.org/api/files/a40e829f-dd24-4e7b-b228-7c4e048a270c/output.zip?uploads=1'}}

We can also add metadata to the record:

In [49]:
data = {
     'metadata': {
         'title': 'AI Preservation',
         'upload_type': 'software',
         'description': 'This use case will explore how to approach the long-term preservation of AI models which are contributing to notebooks',
         'creators': [{'name': 'Candela, Gustavo',
                       'affiliation': 'University of Alicante'}]
     }
 }
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {ACCESS_TOKEN}'
}
r = requests.put('https://sandbox.zenodo.org/api/deposit/depositions/%s' % deposition_id,
                  data=json.dumps(data),
                  headers=headers)
r.status_code

200

The last step is the publication:

In [50]:
headers = {'Authorization': f'Bearer {ACCESS_TOKEN}'}
r = requests.post('https://sandbox.zenodo.org/api/deposit/depositions/%s/actions/publish' % deposition_id,
                      headers=headers)
r.status_code
# 202

202

And now we can see the result in Zenodo:

<img src="imgs/zenodo-publication.png" width="70%">

We can reproduce the same with additional platforms such as [Wikidata](https://www.wikidata.org/) and the [Social Sciences and Humanities Open Marketplace](https://marketplace.sshopencloud.eu/about/api-documentation)

In the particular case of Wikidata, existing [python libraries](https://www.mediawiki.org/wiki/Manual:Pywikibot/Wikidata) can be used to extract and create entities.

### References

- Candela, G., Rosiński, C., & Margraf, A. (2025). A reproducible framework to publish and reuse Collections as data: the case of the European Literary Bibliography (Version 4, Vol. 965, Issue 170). Transformations: A DARIAH Journal . https://doi.org/10.46298/transformations.14729
- Gustavo Candela, Javier Pereda, Dolores Sáez, Pilar Escobar, Alexander Sánchez, Andrés Villa Torres, Albert A. Palacios, Kelly McDonough, and Patricia Murrieta-Flores. 2023. An Ontological Approach for Unlocking the Colonial Archive. J. Comput. Cult. Herit. 16, 4, Article 74 (December 2023), 18 pages. https://doi.org/10.1145/3594727
- Jiarui Liu, Wenkai Li, Zhijing Jin, and Mona Diab. 2024. Automatic Generation of Model and Data Cards: A Step Towards Responsible AI. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1975–1997, Mexico City, Mexico. Association for Computational Linguistics.
- https://developers.zenodo.org/#quickstart-upload
- https://www.echoes-eccch.eu/wp-content/uploads/2025/06/ECHOES_HDT_Ontology.pdf
- https://marketplace.sshopencloud.eu/about/api-documentation
- https://www.wikidata.org/
- https://cidoc-crm.org/sites/default/files/CRMdigv4.0.pdf
- https://www.w3.org/TR/prov-o/