# Parsing RDFa in ISAW's Born Digital Publications

Right now this notebook demonstrates loading and some simple parsing of the RDFa in ISAW's born digital publications. It has the side affect of confirming that the data in those publications is useable and begins to explore whether or not it is useful.

After 'Setup', titles, authors, and referenced resources are listed.

## Setup

The end result of the following cells is a Graph object holding the triples represented by the RDFa encoded in our various publications. Books haven't been added yet. Coming soon...

In [None]:
import json
import os
import pandas as pd
import re
import urllib.request

from bs4 import BeautifulSoup

from IPython.display import HTML

import rdflib

import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)


pd.set_option('display.max_colwidth', -1)

In [None]:
# namespaces for rdflib
ns = {"dcterms" : "http://purl.org/dc/terms/",
      "foaf"    : "http://xmlns.com/foaf/0.1/",
      "owl"     : "http://www.w3.org/2002/07/owl#",
      "rdf"     : "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
      "rdfs"    : "http://www.w3.org/2000/01/rdf-schema#" }

In [None]:
# url_base = 'file:///Users/sfsh/Documents/isaw-papers-awdl/' # UGLY(!) but fast
url_base_papers = 'http://isawnyu.github.io/isaw-papers-awdl/' # development
url_base_books  = 'http://isawnyu.github.io/isaw-books-awdl/'
# url_base = 'http://dlib.nyu.edu/awdl/isaw/isaw-papers/' # fully published

# I do need to add in all the ISAW Papers 7 links. They're coming...
urls_to_load = [{'url':url_base_papers + '1/', 'fn' : 'isaw-papers-1.xhtml'}] 

more = [
                {'url':url_base_papers + '2/', 'fn' : 'isaw-papers-2.xhtml'},
                {'url':url_base_papers + '3/', 'fn' : 'isaw-papers-3.xhtml'},
                {'url':url_base_papers + '4/', 'fn' : 'isaw-papers-4.xhtml'},
                {'url':url_base_papers + '5/', 'fn' : 'isaw-papers-5.xhtml'},
                {'url':url_base_papers + '6/', 'fn' : 'isaw-papers-6.xhtml'},
                {'url':url_base_papers + '7/', 'fn' : 'isaw-papers-7.xhtml'},
                {'url':url_base_papers + '7/elliott-heath-muccigrosso/', 'fn' : 'isaw-papers-7-elliott-heath-muccigrosso.xhtml'},
                {'url':url_base_papers + '7/acheson/', 'fn' : 'isaw-papers-7-acheson.xhtml'},
                {'url':url_base_papers + '7/almas-babeu-krohn/', 'fn' : 'isaw-papers-7-almas-babeu-krohn.xhtml'},
                {'url':url_base_papers + '7/benefiel-sprenkle/', 'fn' : 'isaw-papers-7-benefiel-sprenkle.xhtml'},
                {'url':url_base_papers + '7/blackwell-smith/', 'fn' : 'isaw-papers-7-blackwell-smith.xhtml'},
                {'url':url_base_papers + '7/elliott-jones/', 'fn' : 'isaw-papers-7-elliott-jones.xhtml'},
                {'url':url_base_papers + '7/hafford/', 'fn' : 'isaw-papers-7-hafford.xhtml'},
                {'url':url_base_papers + '7/heath/', 'fn' : 'isaw-papers-7-heath.xhtml'},
                {'url':url_base_papers + '7/horne/', 'fn' : 'isaw-papers-7-horne.xhtml'},
                {'url':url_base_papers + '7/kansa/', 'fn' : 'isaw-papers-7-kansa.xhtml'},
                {'url':url_base_papers + '7/lana/', 'fn' : 'isaw-papers-7-lana.xhtml'},
                {'url':url_base_papers + '7/liuzzo/', 'fn' : 'isaw-papers-7-liuzzo.xhtml'},
                {'url':url_base_papers + '7/mackay/', 'fn' : 'isaw-papers-7-mackay.xhtml'},
                {'url':url_base_papers + '7/mcmichael/', 'fn' : 'isaw-papers-7-mcmichael.xhtml'},
                {'url':url_base_papers + '7/meadows-gruber/', 'fn' : 'isaw-papers-7-meadows-gruber.xhtml'},
                {'url':url_base_papers + '7/meyers/', 'fn' : 'isaw-papers-7-meyers.xhtml'},
                {'url':url_base_papers + '7/murray/', 'fn' : 'isaw-papers-7-murray.xhtml'},
                {'url':url_base_papers + '7/nurmikko-fuller/', 'fn' : 'isaw-papers-7-nurmikko-fuller.xhtml'},
                {'url':url_base_papers + '7/pearce-schmitz/', 'fn' : 'isaw-papers-7-pearce-schmitz.xhtml'},
                {'url':url_base_papers + '7/pett/', 'fn' : 'isaw-papers-7-pett.xhtml'},
                {'url':url_base_papers + '7/poehler/', 'fn' : 'isaw-papers-7-poehler.xhtml'},
                {'url':url_base_papers + '7/rabinowitz/', 'fn' : 'isaw-papers-7-rabinowitz.xhtml'},
                {'url':url_base_papers + '7/reinhard/', 'fn' : 'isaw-papers-7-reinhard.xhtml'},
                {'url':url_base_papers + '7/romanello/', 'fn' : 'isaw-papers-7-romanello.xhtml'},
                {'url':url_base_papers + '7/roueche-lawrence-lawrence/', 'fn' : 'isaw-papers-7-roueche-lawrence-lawrence.xhtml'},
                {'url':url_base_papers + '7/seifried/', 'fn' : 'isaw-papers-7-seifried.xhtml'},
                {'url':url_base_papers + '7/simon-barker-desoto-isaksen/', 'fn' : 'isaw-papers-7-simon-barker-desoto-isaksen.xhtml'},
                {'url':url_base_papers + '7/taylor/', 'fn' : 'isaw-papers-7-taylor.xhtml'},
                {'url':url_base_papers + '7/tsonev/', 'fn' : 'isaw-papers-7-tsonev.xhtml'},
                {'url':url_base_papers + '7/vankeer/', 'fn' : 'isaw-papers-7-vankeer.xhtml'},
                {'url':url_base_papers + '8/', 'fn' : 'isaw-papers-8.xhtml'},
                {'url':url_base_papers + '9/', 'fn' : 'isaw-papers-9.xhtml'},
                {'url':url_base_papers + '10/', 'fn' : 'isaw-papers-10.xhtml'},
                {'url':url_base_papers + '11/', 'fn' : 'isaw-papers-11.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter1.xhtml', 'fn':'oasis-city-ch1.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter2.xhtml', 'fn':'oasis-city-ch2.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter3.xhtml', 'fn':'oasis-city-ch3.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter4.xhtml', 'fn':'oasis-city-ch4.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter5.xhtml', 'fn':'oasis-city-ch5.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter6.xhtml', 'fn':'oasis-city-ch6.xhtml'},
                {'url':url_base_books + 'oasis-city-awdl/chapter7.xhtml', 'fn':'oasis-city-ch7.xhtml'},
                {'url':url_base_books + 'hatke2013-aksum-and-nubia-awdl/', 'fn':'hatke2013-aksum-and-nubia.xhtml'}]

print(urls_to_load)

## Build RDF Graph

In [None]:

g=rdflib.Graph()
for url_fn in urls_to_load:
    g.load(url_fn['url'],format="rdfa")

len(g) # this is the total number of triples in the graph

## List Titles

In [None]:
result = g.query('''SELECT ?title ?s WHERE {
    ?s dcterms:title ?title . 
    ?s dcterms:isPartOf* <http://isaw.nyu.edu/publications/isaw-papers> .  
} ORDER BY ?s''', initNs = ns)

df = pd.DataFrame(result.bindings)
pd.set_option('display.max_rows', len(df))
df

## List Authors

In [None]:

result = g.query('''SELECT ?name WHERE {
 
    ?s dcterms:creator/foaf:name* ?name .
    ?s dcterms:isPartOf <http://isaw.nyu.edu/publications/isaw-papers>

  FILTER isLiteral(?name)
} ORDER BY ?name''', initNs = ns)

df = pd.DataFrame(result.bindings)
pd.set_option('display.max_rows', len(df))
df

## List Referenced Resources

Scroll down for Pleaiades URIs as they're sorted by 'https'. The results show that some of the RDFa markup needs work.

In [None]:
result = g.query('''SELECT DISTINCT ?isawuri ?title ?uri ?label WHERE {
    ?isawuri dcterms:references ?b1ank .
    OPTIONAL {?isawuri dcterms:title ?title .}
    OPTIONAL {?isawpub dcterms:hasPart ?isawuri .
              ?isawpub dcterms:title ?title  . }
    ?b1ank rdfs:isDefinedBy ?uri .
    OPTIONAL {?b1ank rdfs:label ?label .}
    OPTIONAL {?b1ank dcterms:bibliographicCitation ?label .}
} ORDER BY ?uri ?isawpub''', initNs = ns)

df = pd.DataFrame(result.bindings)
pd.set_option('display.max_rows', len(df))
df

In [None]:
s = ''

for row in result:
    s += '<div><a href="%s">%s</a> referenced by <a href="%s">%s</a></div>' % (row.uri, row.label,row.isawuri,row.title)

h = HTML(s)
h

In [None]:
turtle = g.serialize(format="turtle").decode("utf-8")
# the following is only necessary if loading from disk
print(re.sub(r'<file://.*?docs/isaw-papers-([0-9]+?).xhtml',r'<http://dlib.nyu.edu/awdl/isaw/isaw-papers/\1/',turtle))