# Enriching the Moving Image Archive dataset with Wikidata and GeoNames

Created in October-December 2022 for the National Library of Scotland's Data Foundry by [Gustavo Candela, National Librarian’s Research Fellowship in Digital Scholarship 2022-23](https://data.nls.uk/projects/the-national-librarians-research-fellowship-in-digital-scholarship-2022-23/)

### About the Moving Image Archive Dataset

This dataset represents the descriptive metadata from the Moving Image Archive catalogue, which is Scotland’s national collection of moving images.

- Data format: metadata available as MARCXML and Dublin Core
- Data source: https://data.nls.uk/data/metadata-collections/moving-image-archive/

### Table of contents

- [Preparation](#Preparation)
- [Transformation to RDF](#Transformation-to-RDF)
- [Enriching the data](#Enriching-the-dataset)
- [Saving the data](#Saving-the-data)
- [Integration of the data](#Integration-of-the-data)

### Citations

- Candela, G., Sáez, M. D., Escobar, P., & Marco-Such, M. (2022). Reusing digital collections from GLAM institutions. Journal of Information Science, 48(2), 251–267. https://doi.org/10.1177/0165551520950246

### Preparation

Import the libraries required to enrich the dataset with external repositories

In [15]:
import pandas as pd #for handling csv and csv contents
from rdflib import Graph, Literal, RDF, URIRef, Namespace #basic RDF handling

#### We define the URL pattern for locations

In [16]:
locationPattern = "https://example.org/location/"

### Enriching the dataset

We read the CSV file containing the location, latitud, longitud and the external identifiers for Wikidata and GeoNames.

In [17]:
df=pd.read_csv('../data/movingImageArchive/MovingImageArchiveGeoNames.csv',sep=",",quotechar='"',dtype={'geonames': str})
print(df)

             Location       lat     long   wikidata geonames
0            Aberdeen  57.14369 -2.09814     Q36405  2657832
1       Aberdeenshire  57.16667 -2.66667    Q189912  2657830
2               Angus  56.66667 -2.91667    Q202177  2657306
3         Argyllshire  56.25000 -5.25000    Q652539  2657088
4            Ayrshire  55.50000 -4.50000    Q793283  2656700
5               Banff  57.66477 -2.52964     Q54809  2656402
6        Berwickshire  55.75000 -2.50000    Q786649  2655820
7             Borders       NaN      NaN   Q9177476      NaN
8                Bute  55.83333 -5.10000   Q1147435  2654168
9           Caithness  58.41667 -3.50000    Q864668  2654041
10   Clackmannanshire  56.16667 -3.75000    Q207268  2652975
11      Dumfriesshire  55.16667 -3.50000   Q1247384  2650795
12     Dunbartonshire  56.12639 -4.42069  Q17582129  7280022
13             Dundee  56.46913 -2.97489    Q123709  2650752
14       East Lothian  55.91667 -2.75000    Q207257  2650386
15          Edinburgh  5

#### We define the namespaces used to describe the data

In [18]:
g = Graph()

owl = Namespace('http://www.w3.org/2002/07/owl#')
wgs84_pos = Namespace("http://www.w3.org/2003/01/geo/wgs84_pos#")
rdfs = Namespace("http://www.w3.org/2000/01/rdf-schema#")
rdf = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")

g.bind("owl", owl)
g.bind("wgs84_pos", wgs84_pos)
g.bind("rdfs", rdfs)
g.bind("rdf", rdf)

#### We iterate through the rows of the CSV file to transform the information into RDF

In [19]:
for index, row in df.iterrows():

    location = URIRef(locationPattern + row['Location'].lower().replace(",", "").replace(" ", ""))
    
    ## add lat and long
    if not pd.isnull(row['lat']):
        g.add((location, URIRef(wgs84_pos+'lat'), Literal(str(row["lat"])) ))
    if not pd.isnull(row['long']):
        g.add((location, URIRef(wgs84_pos+'long'), Literal(str(row["long"]))))
    
    if not pd.isnull(row['wikidata']):
        g.add((location, URIRef(owl+'sameAs'), URIRef('https://www.wikidata.org/wiki/' + str(row['wikidata']))))
    if not pd.isnull(row['geonames']):   
        g.add((location, URIRef(owl+'sameAs'), URIRef('https://www.geonames.org/' + str(row['geonames']))))

### Saving the metadata

In [20]:
print(g.serialize(format='turtle'))

g.serialize('../rdf/locations.ttl')

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<https://example.org/location/aberdeen> owl:sameAs <https://www.geonames.org/2657832>,
        <https://www.wikidata.org/wiki/Q36405> ;
    wgs84_pos:lat "57.14369" ;
    wgs84_pos:long "-2.09814" .

<https://example.org/location/aberdeenshire> owl:sameAs <https://www.geonames.org/2657830>,
        <https://www.wikidata.org/wiki/Q189912> ;
    wgs84_pos:lat "57.16667" ;
    wgs84_pos:long "-2.66667" .

<https://example.org/location/angus> owl:sameAs <https://www.geonames.org/2657306>,
        <https://www.wikidata.org/wiki/Q202177> ;
    wgs84_pos:lat "56.66667" ;
    wgs84_pos:long "-2.91667" .

<https://example.org/location/argyllshire> owl:sameAs <https://www.geonames.org/2657088>,
        <https://www.wikidata.org/wiki/Q652539> ;
    wgs84_pos:lat "56.25" ;
    wgs84_pos:long "-5.25" .

<https://example.org/location/ayrshire> owl:sameAs <https://www.geonames.org/2656700>

<Graph identifier=N18511275484e4e309d0618f510024693 (<class 'rdflib.graph.Graph'>)>

### Integration of the data

Let's merge the original RDF dataset with the information that we have created about geographic locations

In [21]:
g = Graph()
g.parse("../rdf/locations.ttl")
g.parse("../rdf/dataset.ttl")
g.serialize('../rdf/datasetEnriched.ttl')

<Graph identifier=Nee0a8ea9d40e4fbaadee906c04251e9e (<class 'rdflib.graph.Graph'>)>