# Transforming the Moving Image Archive dataset to RDF

Created in October-December 2022 for the National Library of Scotland's Data Foundry by [Gustavo Candela, National Librarian’s Research Fellowship in Digital Scholarship 2022-23](https://data.nls.uk/projects/the-national-librarians-research-fellowship-in-digital-scholarship-2022-23/)

### About the Moving Image Archive Dataset

This dataset represents the descriptive metadata from the Moving Image Archive catalogue, which is Scotland’s national collection of moving images.

- Data format: metadata available as MARCXML and Dublin Core
- Data source: https://data.nls.uk/data/metadata-collections/moving-image-archive/

### Table of contents

- [Preparation](#Preparation)
- [Transformation to RDF](#Transformation-to-RDF)

### Citations

- Candela, G., Sáez, M. D., Escobar, P., & Marco-Such, M. (2022). Reusing digital collections from GLAM institutions. Journal of Information Science, 48(2), 251–267. https://doi.org/10.1177/0165551520950246

### Preparation

Import the libraries required to explore the summary of each record included in the dataset to present a word cloud

In [1]:
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import FOAF, RDF, DCTERMS, VOID, DC, SKOS
import pandas as pd

### Transformation to RDF

*Note: The variable domain could be updated to the domain of the organisation (e.g., https://data.nls.uk).

In [2]:
domain = 'https://example.org/'

#### First, we instantiate all the namespaces that we will use when defining the RDF data

In [3]:
g = Graph()
g.bind("foaf", FOAF)
g.bind("rdf", RDF)
g.bind("dcterms", DCTERMS)
g.bind("dc", DC)
g.bind("void", VOID)
g.bind("skos", SKOS)

schema = Namespace("https://schema.org/")
g.bind("schema", schema)

edm = Namespace("http://www.europeana.eu/schemas/edm/")
g.bind("edm", edm)

#### We define the resource National Library of Scotland

In [4]:
nls = URIRef(domain + "organisation/nls")
g.add((nls, RDF.type, schema.Organization))
g.add((nls, schema.url, URIRef("https://www.nls.uk/")))
g.add((nls, schema.logo, URIRef("https://www.nls.uk/images/nls-logo.png")))
g.add((nls, schema.name, Literal("National Library of Scotland")))
g.add((nls, DC.title, Literal("National Library of Scotland")))

<Graph identifier=N24baaa86af1b4d2199fbfdcc3b567d05 (<class 'rdflib.graph.Graph'>)>

#### Let's transform the records provided by the CSV file into RDF

In [5]:

df = pd.read_csv ('../data/output/movingImageArchive.csv', names=('title','author','authorOrganisation','place_publication',
                             'date','extent','credits',\
                             'subjects','summary','details','link','geographicNames',\
                             'contentType','mediaType','carrierType','generalNote','thumbnail'))
print(df)
df = df.reset_index()  # make sure indexes pair with number of rows

                                   title                        author  \
0                                  title                        author   
1      GLASGOW TRAMS AND BOTANIC GARDENS  RUSSELL, Stanley Livingstone   
2         LAST DAY OF THE TRAMS, GLASGOW                           NaN   
3                         INTO THE MISTS                           NaN   
4            PASSING OF THE TRAMCAR, the                           NaN   
...                                  ...                           ...   
20604      N.P. NASSAU BAY  Ship No. 689                           NaN   
20605         DREDGING IN THE RIVER TEES                           NaN   
20606     AUTOMATION ON A SUCTION DREDGE                           NaN   
20607      QUEEN ELIZABETH  Ship No. 552                           NaN   
20608                            RUAHINE                           NaN   

       authorOrganisation                     place_publication  date  \
0      authorOrganisation             

In [6]:
for index, row in df.iterrows():
    if index != 0:
        video = URIRef(domain + row["link"].replace("http://movingimage.nls.uk/","").replace(" ","").strip())
        g.add((video, RDF.type, URIRef("https://schema.org/VideoObject")))
        g.add((video, schema.sourceOrganization, nls))
        
        if pd.notnull(row["title"]):
            g.add((video, DC.title, Literal(row["title"].strip())))
            g.add((video, schema.name, Literal(row["title"].strip())))
        
        if pd.notnull(row["extent"]):
            g.add((video, schema.duration, Literal(row["extent"].strip())))
            
        if pd.notnull(row["thumbnail"]):
            g.add((video, schema.thumbnail, URIRef(row["thumbnail"].strip())))
        
        if pd.notnull(row["credits"]):
            g.add((video, schema.creditText, Literal(row["credits"].strip())))
            
        if pd.notnull(row["summary"]):
            g.add((video, schema.abstract, Literal(row["summary"].strip())))
            
        if pd.notnull(row["details"]):
            g.add((video, schema.videoQuality, Literal(row["details"].strip())))
            
        if pd.notnull(row["date"]):
            g.add((video, schema.datePublished, Literal(row["date"].strip())))
            g.add((video, DC.date, Literal(row["date"].strip())))
            
        if pd.notnull(row["link"]):
            g.add((video, schema.identifier, URIRef(row["link"].replace(" - du", "").strip())))
            g.add((video, DC.identifier, URIRef(row["link"].replace(" - du", "").strip())))
        
        if pd.notnull(row["subjects"]):
            subjects = row["subjects"].split("--")
            for r in subjects:
                g.add((video, DC.subject, Literal(r.strip())))
                
        if pd.notnull(row["geographicNames"]):
            geographicNames = row["geographicNames"].split("--")
            for r in geographicNames:
                r = r.replace(",","")
                r = r.replace(" ","")
                place = URIRef(domain + 'location/' + r.lower().strip())
                
                g.add((video, DCTERMS.spatial, place))
                g.add((place, RDF.type, schema.Place))
                g.add((place, RDF.type, edm.Place))
                g.add((place, SKOS.prefLabel, Literal(r.strip())))
                g.add((place, schema.name, Literal(r.strip())))
                
        if pd.notnull(row["author"]):
            authors = row["author"].split("--")
            for r in authors:
                authorText = r;
               
                if "/" in authorText:
                    authorText = authorText[0:authorText.index("/")]
                authorText = authorText.lower().strip()
                authorText = authorText.replace("’", "")
                authorText = authorText.replace(".", "")
                authorText = authorText.replace("(", "")
                authorText = authorText.replace(")", "")
                authorText = authorText.replace(",", "")
                authorText = authorText.replace("‘", "")
                authorText = authorText.replace(" ", "")
                author = URIRef(domain + 'author/' + authorText)
                
                g.add((video, schema.author, author))
                g.add((author, RDF.type, schema.Person))
                g.add((author, RDF.type, FOAF.Person))
                g.add((author, SKOS.prefLabel, Literal(r.strip())))
                g.add((author, schema.name, Literal(r.strip())))
                g.add((author, FOAF.name, Literal(r.strip())))

        if pd.notnull(row["authorOrganisation"]) :
            authors = row["authorOrganisation"].split("--")
            for r in authors:
                authorText = r;
               
                if "/" in authorText:
                    authorText = authorText[0:authorText.index("/")]

                authorText = authorText.replace("’", "")
                authorText = authorText.replace(".", "")
                authorText = authorText.replace("(", "")
                authorText = authorText.replace(")", "")
                authorText = authorText.replace(",", "")
                authorText = authorText.replace("‘", "")
                authorText = authorText.replace(" ", "")
                author = URIRef(domain + 'organization/' + authorText.lower())
                
                g.add((video, schema.author, author))
                g.add((author, RDF.type, schema.Organization))
                g.add((author, RDF.type, FOAF.Organization))
                g.add((author, SKOS.prefLabel, Literal(r.strip())))
                g.add((author, schema.name, Literal(r.strip())))
                g.add((author, FOAF.name, Literal(r.strip())))

In [7]:
g.serialize(destination="../rdf/dataset.ttl")

<Graph identifier=N24baaa86af1b4d2199fbfdcc3b567d05 (<class 'rdflib.graph.Graph'>)>