# Parsing RDFa in ISAW's Born Digital Publications

Right now this notebook demonstrates loading and some simple parsing of the RDFa in ISAW's born digital publications. It has the side affect of confirming that the data in those publications is useable and begins to explore whether or not it is useful.

After 'Setup', titles, authors, and referenced resources are listed.

## Setup

The end result of the following cells is a Graph object holding the triples represented by the RDFa encoded in our various publications. Books haven't been added yet. Coming soon...

In [1]:
import os
import urllib.request


from bs4 import BeautifulSoup

import rdflib

import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

INFO:rdflib:RDFLib Version: 4.2.1


In [2]:
# namespaces for rdflib
ns = {"dcterms" : "http://purl.org/dc/terms/",
      "foaf"    : "http://xmlns.com/foaf/0.1/",
      "owl"     : "http://www.w3.org/2002/07/owl#",
      "rdf"     : "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
      "rdfs"    : "http://www.w3.org/2000/01/rdf-schema#" }

In [3]:
# UGLY(!) but fast for now...
url_base = 'file:///Users/sfsh/Documents/isaw-papers-awdl/'
# url_base = 'http://isawnyu.github.io/isaw-papers-awdl/' # development
# url_base = 'http://dlib.nyu.edu/awdl/isaw/isaw-papers/' # fully published

# I do need to add in all the ISAW Papers 7 links. They're coming...
urls_to_load = [{'url':url_base + '1/',  'fn' : 'isaw-papers-1.xhtml'},
                {'url':url_base + '2/',  'fn' : 'isaw-papers-2.xhtml'},
                {'url':url_base + '3/',  'fn' : 'isaw-papers-3.xhtml'},
                {'url':url_base + '4/',  'fn' : 'isaw-papers-4.xhtml'},
                {'url':url_base + '5/',  'fn' : 'isaw-papers-5.xhtml'},
                {'url':url_base + '6/',  'fn' : 'isaw-papers-6.xhtml'},
                {'url':url_base + '7/',  'fn' : 'isaw-papers-7.xhtml'},
                {'url':url_base + '7/elliott-heath-muccigrosso/',  'fn' : 'isaw-papers-7-elliott-heath-muccigrosso.xhtml'},
                {'url':url_base + '7/acheson/',  'fn' : 'isaw-papers-7-acheson.xhtml'},
                {'url':url_base + '7/almas-babeu-krohn/',  'fn' : 'isaw-papers-7-almas-babeu-krohn.xhtml'},
                {'url':url_base + '7/benefiel-sprenkle/',  'fn' : 'isaw-papers-7-benefiel-sprenkle.xhtml'},
                {'url':url_base + '7/blackwell-smith/',  'fn' : 'isaw-papers-7-blackwell-smith.xhtml'},
                {'url':url_base + '7/elliott-jones/', 'fn' : 'isaw-papers-7-elliott-jones.xhtml'},
                {'url':url_base + '7/hafford/',  'fn' : 'isaw-papers-7-hafford.xhtml'},
                {'url':url_base + '7/heath/',  'fn' : 'isaw-papers-7-heath.xhtml'},
                {'url':url_base + '7/horne/',  'fn' : 'isaw-papers-7-horne.xhtml'},
                {'url':url_base + '7/kansa/',  'fn' : 'isaw-papers-7-kansa.xhtml'},
                {'url':url_base + '7/lana/',  'fn' : 'isaw-papers-7-lana.xhtml'},
                {'url':url_base + '7/liuzzo/',  'fn' : 'isaw-papers-7-liuzzo.xhtml'},
                {'url':url_base + '7/mackay/',  'fn' : 'isaw-papers-7-mackay.xhtml'},
                {'url':url_base + '7/mcmichael/',  'fn' : 'isaw-papers-7-mcmichael.xhtml'},
                {'url':url_base + '7/meadows-gruber/',  'fn' : 'isaw-papers-7-meadows-gruber.xhtml'},
                {'url':url_base + '7/meyers/',  'fn' : 'isaw-papers-7-meyers.xhtml'},
                {'url':url_base + '7/murray/',  'fn' : 'isaw-papers-7-murray.xhtml'},
                {'url':url_base + '7/nurmikko-fuller/',  'fn' : 'isaw-papers-7-nurmikko-fuller.xhtml'},
                {'url':url_base + '7/pearce-schmitz/', 'fn' : 'isaw-papers-7-pearce-schmitz.xhtml'},
                {'url':url_base + '7/pett/', 'fn' : 'isaw-papers-7-pett.xhtml'},
                {'url':url_base + '7/poehler/', 'fn' : 'isaw-papers-7-poehler.xhtml'},
                {'url':url_base + '7/rabinowitz/', 'fn' : 'isaw-papers-7-rabinowitz.xhtml'},
                {'url':url_base + '7/reinhard/', 'fn' : 'isaw-papers-7-reinhard.xhtml'},
                {'url':url_base + '7/romanello/', 'fn' : 'isaw-papers-7-romanello.xhtml'},
                {'url':url_base + '7/roueche-lawrence-lawrence/', 'fn' : 'isaw-papers-7-roueche-lawrence-lawrence.xhtml'},
                {'url':url_base + '7/seifried/', 'fn' : 'isaw-papers-7-seifried.xhtml'},
                {'url':url_base + '7/simon-barker-desoto-isaksen/', 'fn' : 'isaw-papers-7-simon-barker-desoto-isaksen.xhtml'},
                {'url':url_base + '7/taylor/', 'fn' : 'isaw-papers-7-taylor.xhtml'},
                {'url':url_base + '7/tsonev/', 'fn' : 'isaw-papers-7-tsonev.xhtml'},
                {'url':url_base + '7/vankeer/', 'fn' : 'isaw-papers-7-vankeer.xhtml'},
                {'url':url_base + '8/',  'fn' : 'isaw-papers-8.xhtml'},
                {'url':url_base + '9/',  'fn' : 'isaw-papers-9.xhtml'},
                {'url':url_base + '10/', 'fn' : 'isaw-papers-10.xhtml'},
                {'url':url_base + '11/', 'fn' : 'isaw-papers-11.xhtml'}]
    

In [4]:
# put everything into a local directory so that connectivity issues
# don't interfere with debugging next cell

for url_fn in urls_to_load:
    url = url_fn['url']
    if 'file:' in url:
        url = url +  'index.xhtml'
    urllib.request.urlretrieve(url,'docs/' + url_fn['fn'])

In [5]:
# ISAW's digital publications use RDFa. Meaning they are RDF. That can be loaded into a graph...
g=rdflib.Graph()

for url_fn in urls_to_load:
    g.load('docs/' + url_fn['fn'],format="rdfa")

len(g) # this is the total number of triples in the graph

4947

## List Titles

In [6]:
result = g.query('''SELECT ?title ?s WHERE {
    OPTIONAL { ?s dcterms:title ?title . }
    ?s dcterms:isPartOf* <http://isaw.nyu.edu/publications/isaw-papers> .  
} ORDER BY ?s''', initNs = ns)

# we want to find this many titles:
num_titles = 41

if len(result) == num_titles:
    print("%s Titles found:" % str(len(result)))
else:
    print("%s titles found, which is not right number. Should be %s." % (str(len(result)),str(num_titles)))

for row in result:
    print(" " + row.title)
    
# title won't print quite in number order as that's not padded in URL.

41 Titles found:
 A New Discovery of a Component of Greek Astrology in Babylonian Tablets: The “Terms”
 Preliminary Report on Early Byzantine Pottery from a Building Complex at Kenchreai (Greece)
 The Moon Phase Anomaly in the Antikythera Mechanism
 Review of Ptolemaic Numismatics, 1996 to 2007
 Rome and the Economic Integration of Empire
 The Cosmos in the Antikythera Mechanism
 A Syriac Fragment from The Cause of All Causes on the Pillars of Hercules
 The Quartier du Stade on late Hellenistic Delos: a case study of rapid urbanization (fieldwork seasons 2009-2010)
 Linked Open Bibliographies in Ancient Studies
 Linked Data in the Perseus Digital Library
 The Herculaneum Graffiti Project
 The Homer Multitext and RDF-Based Integration
 Prologue and Introduction
 Moving the Ancient World Online Forward
 Linked Open Data and the Ur of the Chaldees Project
 ISAW Papers: Towards a Journal as Linked Open Data
 Beyond Maps as Images at the Ancient World Mapping Center
 Open Context and Linked

## List Authors

Not sure why the DISTINCT keyword isn't working.

In [7]:

sresult = g.query('''SELECT DISTINCT ?name WHERE {
 
    ?s dcterms:creator/foaf:name* ?name .
    ?s dcterms:isPartOf* <http://isaw.nyu.edu/publications/isaw-papers>

  FILTER isLiteral(?name)
} ORDER BY ?name''', initNs = ns)

for row in sresult:
    print(row.name)
    
# hmmm... why isn't ?name distinct?

A. L. McMichael
Adam Rabinowitz
Alison Babeu
Andrew Meadows
Andrew Reinhard
Anna Krohn
Bridget Almas
Camilla MacKay
Charlotte Roueché
Christian Miks
Christián C. Carman
Christopher W. Blackwell
Chuck Jones
D. Neel Smith
Ellen Van Keer
Elton Barker
Eric C. Kansa
Ethan Gruber
Federico De Romanis
Gavin Blasdel
John Muccigrosso
Jon Taylor
Jorge J. Bravo III
Joseph L. Rife
K. Faith Lawrence
Katy M. Meyers
Keith Lawrence
Laurie Pearce
Leif Isaksen
Marcelo Di Cocco
Matteo Romanello
Maurizio Lana
Paola Davoli
Patrick Schmitz
Pau de Soto
Phoebe Acheson
Pietro Maria Liuzzo
Rainer Simon
Rebecca Benefiel
Rebecca M. Seifried
Ryan Horne
Sara Sprenkle
Sebastian Heath
Terhi Nurmikko-Fuller
Tom Elliott
Tsoni Tsonev
William B. Hafford
William Murray
Adam C. McCollum
Alexander Jones
Andrew Meadows
Catharine Lorber
Daniel E.J. Pett
Gilles Bransbourg
John M. Steele
John Muccigrosso
Mantha Zarmakoupi
Sebastian Heath
Thomas Elliott
Tony Freeth


## List Referenced Resources

Scroll down for Pleaiades URIs as they're sorted by 'https'. The results show that some of the RDFa markup needs work.

In [8]:
# rdflib is throwing an error when I use property paths.
result = g.query('''SELECT ?uri ?label WHERE {
    ?s dcterms:references ?b1 .
    ?b1 rdfs:isDefinedBy ?uri .
    OPTIONAL {?b1 rdfs:label ?label .}
    OPTIONAL {?b1 dcterms:bibliographicCitation ?label .}
} ORDER BY ?uri''', initNs = ns)

for row in result:
    print("%s %s" % (row.label, row.uri))

Pliny the Elder http://catalog.perseus.org/catalog/urn:cite:perseus:author.1141
Patton, Stacy.  “The Dissertation Can No Longer Be Defended.” The Chronicle of Higher Education 11 February 2013. http://chronicle.com/article/The-Dissertation-Can-No-Longer/137215/
BM 36628+36817+37197 http://collection.britishmuseum.org/id/object/WCT137352
BM 33066 (formerly Strassmaier Cambyses 400, a Neo-Babylonian astronomical text whose arrangement Kugler imperfectly understood; description available on BM website) http://collection.britishmuseum.org/id/object/WCT191051
K.2067 (formerly III R 57, a Neo-Assyrian duplicate or relative to the "Great Star List"; description available on BM website) http://collection.britishmuseum.org/id/object/WCT3464
BM 36326 http://collection.britishmuseum.org/id/object/WCT63181
K.4386 (formerly II R 48, a Neo-Assyrian copy of the lexical text ur5-ra = hubullu; description available on BM website) http://collection.britishmuseum.org/id/object/WCT6693
Vegetius (Epitome 2