Goal: reach desired course end state via meandering Jupyter notebook narrative

Desired end state: [JSON Lines](https://jsonlines.org/) file of [JSON-LD](https://json-ld.org/) objects that represents all or a portial of the Nobel Prize dataset such that a compenency question can be answered efficiently with MongoDB.

For the MongoDB part, perhaps use `jq` to filter JSON Lines by `rdf:type` in order to `mongoimport` to the appropriate Mongo collections. Or use Python for this.

# i. Fetch Nobel Prize data as SPARQL JSON response

1. Go to <https://data.nobelprize.org/sparql>.
2. Enter this query:
    ```sparql
    PREFIX nobel: <http://data.nobelprize.org/terms/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT ?s ?p ?o WHERE {
      ?s ?p ?o .
    }
    ```
3. Click "Response" results view (default may be "Table")
4. Click "Download result"
5. `gzip` result (~30x compression from ~30MB to ~1MB).

In [1]:
!du -h data/00-raw-sparql-response.json.gz

1.2M	data/00-raw-sparql-response.json.gz


# ii. Load JSON response as list of statements and serialize as RDF

1. into memory as Python dict
2. map to list of statements
3. load into RDFLib and save as RDF

In [2]:
import gzip
import json

with gzip.open("data/00-raw-sparql-response.json.gz") as f:
    response = json.loads(f.read())

In [3]:
statements = []
for binding in response["results"]["bindings"]:
    statements.append((binding["s"], binding["p"], binding["o"]))

In [4]:
from rdflib import Graph, URIRef, Literal, BNode, Namespace

g = Graph()
for statement in statements:
    s, p, o = statement
    if s["type"] not in ("uri", "bnode"):
        raise ValueError("subs must be uris or bnodes")
    if p["type"] != "uri":
        raise ValueError("preds must be uris")
    if o["type"] not in ("uri", "bnode", "literal"):
        raise ValueError("objs must be uris or bnodes or literals")
    if o["type"] == "literal" and len(set(o) - {"type", "value", "datatype", "xml:lang"}):
        raise ValueError("literal objs can only have datatype and xml:lang apart from value")
        
    s = URIRef(s["value"]) if s["type"] == "uri" else BNode(s["value"])
    p = URIRef(p["value"])
    if o["type"] == "uri":
        o = URIRef(o["value"])
    elif o["type"] == "bnode":
        o = BNode(o["value"])
    else:  # o["type"] == "literal"
        o = Literal(o["value"], lang=o.get("xml:lang"), datatype=o.get("datatype"))
    
    g.add((s, p, o))

In [5]:
g.serialize("data/01-nobelprize-data.nt", format="nt")
!gzip -f data/01-nobelprize-data.nt
!du -h data/01-nobelprize-data.nt.gz

784K	data/01-nobelprize-data.nt.gz


# 1. representing facts: RDF

Load into RDF graph using rdflib

In [6]:
import gzip

from toolz import take

with gzip.open("data/01-nobelprize-data.nt.gz", "rt") as f:
    for line in take(100, f):
        print(line)

<http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1975/406> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Award> .

<http://data.nobelprize.org/resource/laureate/962> <http://xmlns.com/foaf/0.1/familyName> "Strickland" .

<http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1967/387> <http://purl.org/dc/terms/isPartOf> <http://data.nobelprize.org/resource/nobelprize/Physiology_or_Medicine/1967> .

<http://data.nobelprize.org/resource/city/College_Park%2C_MD> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/City> .

<http://data.nobelprize.org/resource/city/Berkeley%2C_CA> <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/Q484678> .

<http://data.nobelprize.org/resource/laureate/200> <http://dbpedia.org/property/dateOfDeath> "1976-09-26"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1985/4

In [7]:
from rdflib import Graph

g = Graph()

with gzip.open("data/01-nobelprize-data.nt.gz") as f:
    g.parse(f.read())

In [8]:
for s, p, o in take(5, g):
    print(s, p, o)

http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1975/406 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Award
http://data.nobelprize.org/resource/laureate/962 http://xmlns.com/foaf/0.1/familyName Strickland
http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1967/387 http://purl.org/dc/terms/isPartOf http://data.nobelprize.org/resource/nobelprize/Physiology_or_Medicine/1967
http://data.nobelprize.org/resource/city/College_Park%2C_MD http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/City
http://data.nobelprize.org/resource/city/Berkeley%2C_CA http://www.w3.org/2002/07/owl#sameAs http://www.wikidata.org/entity/Q484678


In [9]:
list(g.namespaces())

[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))]

In [10]:
for s, p, o in take(5, g):
    print(s, p.n3(g.namespace_manager), o.n3(g.namespace_manager))

http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1975/406 rdf:type <http://dbpedia.org/ontology/Award>
http://data.nobelprize.org/resource/laureate/962 <http://xmlns.com/foaf/0.1/familyName> "Strickland"
http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1967/387 <http://purl.org/dc/terms/isPartOf> <http://data.nobelprize.org/resource/nobelprize/Physiology_or_Medicine/1967>
http://data.nobelprize.org/resource/city/College_Park%2C_MD rdf:type <http://dbpedia.org/ontology/City>
http://data.nobelprize.org/resource/city/Berkeley%2C_CA <http://www.w3.org/2002/07/owl#sameAs> <http://www.wikidata.org/entity/Q484678>


In [11]:
def pprint_terms(terms, graph=g):
    print(*[t.n3(graph.namespace_manager) for t in terms])

In [12]:
from rdflib import Namespace
from rdflib.namespace import RDF
from toolz import take

NOBEL = Namespace("http://data.nobelprize.org/terms/")
g.namespace_manager.bind("nobel", NOBEL)

for s, p, o in take(5, g.triples((None, RDF.type, NOBEL.Laureate))):
    pprint_terms([s, p, o], g)

<http://data.nobelprize.org/resource/laureate/82> rdf:type nobel:Laureate
<http://data.nobelprize.org/resource/laureate/481> rdf:type nobel:Laureate
<http://data.nobelprize.org/resource/laureate/317> rdf:type nobel:Laureate
<http://data.nobelprize.org/resource/laureate/197> rdf:type nobel:Laureate
<http://data.nobelprize.org/resource/laureate/63> rdf:type nobel:Laureate


In [13]:
for s, p, o in g.triples((NOBEL.Laureate, None, None)):
    pprint_terms([s, p, o], g)

In [14]:
print(NOBEL.Laureate)

http://data.nobelprize.org/terms/Laureate


In [15]:
g.parse(NOBEL.Laureate)

<Graph identifier=N6cd30e441e9a45fa820edbbec0d335b2 (<class 'rdflib.graph.Graph'>)>

In [16]:
for s, p, o in g.triples((NOBEL.Laureate, None, None)):
    pprint_terms([s, p, o], g)

nobel:Laureate rdf:type owl:Class
nobel:Laureate rdfs:label "Laureate"
nobel:Laureate rdfs:comment "A laureate is a person or organization who recieves one or several Nobel Prizes."
nobel:Laureate rdfs:subClassOf <http://xmlns.com/foaf/0.1/Agent>


In [17]:
from rdflib.namespace import FOAF

g.namespace_manager.bind("foaf", FOAF)

In [18]:
for s, p, o in g.triples((NOBEL.Laureate, None, None)):
    pprint_terms([s, p, o], g)

nobel:Laureate rdf:type owl:Class
nobel:Laureate rdfs:label "Laureate"
nobel:Laureate rdfs:comment "A laureate is a person or organization who recieves one or several Nobel Prizes."
nobel:Laureate rdfs:subClassOf foaf:Agent


In [19]:
from rdflib.namespace import OWL

for s, p, o in g.triples((OWL.Class, None, None)):
    pprint_terms([s, p, o], g)

In [20]:
print(OWL.Class)

http://www.w3.org/2002/07/owl#Class


In [21]:
g.parse(OWL.Class)

<Graph identifier=N6cd30e441e9a45fa820edbbec0d335b2 (<class 'rdflib.graph.Graph'>)>

In [22]:
from rdflib.namespace import OWL

for s, p, o in g.triples((OWL.Class, None, None)):
    pprint_terms([s, p, o], g)

owl:Class rdf:type rdfs:Class
owl:Class rdfs:comment "The class of OWL classes."
owl:Class rdfs:isDefinedBy <http://www.w3.org/2002/07/owl#>
owl:Class rdfs:label "Class"
owl:Class rdfs:subClassOf rdfs:Class


In [23]:
from rdflib.namespace import OWL

for s, p, o in g.triples((FOAF.Agent, None, None)):
    pprint_terms([s, p, o], g)

In [24]:
print(FOAF.Agent)

http://xmlns.com/foaf/0.1/Agent


In [25]:
g.parse(FOAF.Agent)

<Graph identifier=N6cd30e441e9a45fa820edbbec0d335b2 (<class 'rdflib.graph.Graph'>)>

In [26]:
from rdflib.namespace import OWL

for s, p, o in g.triples((FOAF.Agent, None, None)):
    pprint_terms([s, p, o], g)

foaf:Agent rdf:type owl:Class
foaf:Agent rdf:type rdfs:Class
foaf:Agent vs:term_status "stable"
foaf:Agent rdfs:label "Agent"
foaf:Agent rdfs:comment "An agent (eg. person, group, software or physical artifact)."
foaf:Agent owl:equivalentClass dcterms:Agent


In [27]:
list(g.namespaces())

[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#')),
 ('nobel', rdflib.term.URIRef('http://data.nobelprize.org/terms/')),
 ('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/')),
 ('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/')),
 ('owl', rdflib.term.URIRef('http://www.w3.org/2002/07/owl#')),
 ('role', rdflib.term.URIRef('http://purl.org/role/terms/')),
 ('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/')),
 ('grddl', rdflib.term.URIRef('http://www.w3.org/2003/g/data-view#')),
 ('vs', rdflib.term.URIRef('http://www.w3.org/2003/06/sw-vocab-status/ns#')),
 ('wot', rdflib.term.URIRef('http://xmlns.com/wot/0.1/'))]

# 2. representing terminology: RDFS and OWL

In [28]:
def term_in_ns(term, ns):
    return str(subj).startswith(str(ns))

In [29]:
from rdflib.namespace import OWL

for subj, pred, obj in g:
    if term_in_ns(subj, NOBEL):
        pprint_terms([subj, pred, obj], g)

nobel:laureate rdfs:label "laureate"
nobel:motivation rdfs:label "motivation"
nobel:motivation rdfs:domain nobel:LaureateAward
nobel:year rdfs:domain <http://dbpedia.org/ontology/Award>
nobel:sortOrder rdfs:label "sort order"
nobel:Economic_Sciences rdf:type owl:Thing
nobel:Litterature rdf:type owl:Thing
nobel:Physics rdfs:label "Physics"
nobel:Category rdfs:label "Nobel Prize category"
nobel:motivation rdfs:domain nobel:NobelPrize
nobel:laureateAward rdfs:subPropertyOf <http://dbpedia.org/ontology/Award>
nobel:share rdfs:label "share"
nobel:sortOrder rdf:type owl:DatatypeProperty
<http://data.nobelprize.org/terms/> rdf:type owl:Ontology
nobel:NobelPrize rdfs:seeAlso <http://dbpedia.org/resource/Nobel_Prize>
nobel:university rdfs:label "university"
nobel:Chemistry rdfs:label "Chemistry"
nobel:university rdfs:comment "Points to the universities the Laureate was affiliated with during the period he did his contribution that laid the ground for the award."
nobel:LaureateAward rdfs:subClas

In [30]:
from rdflib.namespace import OWL

datatype_properties = []

for subj, pred, obj in g.triples((None, RDF.type, OWL.DatatypeProperty)):
    if term_in_ns(subj, NOBEL):
        pprint_terms([subj, pred, obj], g)
        datatype_properties.append(subj)

nobel:year rdf:type owl:DatatypeProperty
nobel:motivation rdf:type owl:DatatypeProperty
nobel:share rdf:type owl:DatatypeProperty
nobel:sortOrder rdf:type owl:DatatypeProperty


In [31]:
for prop in datatype_properties:
    for pred, obj in g.predicate_objects(prop):
        pprint_terms([prop, pred, obj], g)
    print()

nobel:year rdf:type owl:DatatypeProperty
nobel:year rdfs:label "year"
nobel:year rdfs:comment "The year a given Nobel Prize was given."
nobel:year rdfs:domain <http://dbpedia.org/ontology/Award>
nobel:year rdfs:range xsd:Integer

nobel:motivation rdf:type owl:DatatypeProperty
nobel:motivation rdfs:label "motivation"
nobel:motivation rdfs:comment "The motivation for why the laureate or was given the Nobel Prize or the motivation for the whole prize."
nobel:motivation rdfs:domain nobel:NobelPrize
nobel:motivation rdfs:domain nobel:LaureateAward
nobel:motivation rdfs:range rdf:PlainLiteral

nobel:share rdf:type owl:DatatypeProperty
nobel:share rdfs:label "share"
nobel:share rdfs:comment "The share of a Nobel Prize given to a Laureate, may be 1, 2, 3 or 4 corresponding to the whole prize, half the prize, third of the priza and a quarter of the prize."
nobel:share rdfs:domain <http://dbpedia.org/ontology/LaureateAward>
nobel:share rdfs:range xsd:Integer

nobel:sortOrder rdf:type owl:Datatyp

In [32]:
from rdflib.namespace import OWL

object_properties = []

for subj, pred, obj in g.triples((None, RDF.type, OWL.ObjectProperty)):
    if term_in_ns(subj, NOBEL):
        pprint_terms([subj, pred, obj], g)
        object_properties.append(subj)

nobel:laureateAward rdf:type owl:ObjectProperty
nobel:nobelPrize rdf:type owl:ObjectProperty
nobel:category rdf:type owl:ObjectProperty
nobel:laureate rdf:type owl:ObjectProperty
nobel:university rdf:type owl:ObjectProperty


In [33]:
for prop in object_properties:
    for pred, obj in g.predicate_objects(prop):
        pprint_terms([prop, pred, obj], g)
    print()

nobel:laureateAward rdf:type owl:ObjectProperty
nobel:laureateAward rdfs:label "laureateAward"
nobel:laureateAward rdfs:comment "Connects each laureate with the part of the Nobel Prize, that is the LaureateAward, he or she recieved."
nobel:laureateAward rdfs:domain nobel:Laureate
nobel:laureateAward rdfs:range nobel:LaureateAward
nobel:laureateAward rdfs:subPropertyOf <http://dbpedia.org/ontology/Award>

nobel:nobelPrize rdf:type owl:ObjectProperty
nobel:nobelPrize rdfs:label "nobelPrize"
nobel:nobelPrize rdfs:comment "Points to the Nobel Prize recieved by a Laureate."
nobel:nobelPrize rdfs:domain nobel:Laureate
nobel:nobelPrize rdfs:range nobel:NobelPrize
nobel:nobelPrize rdfs:subPropertyOf <http://dbpedia.org/ontology/Award>

nobel:category rdf:type owl:ObjectProperty
nobel:category rdfs:label "category"
nobel:category rdfs:comment "The category this Nobel Prize belongs to."
nobel:category rdfs:domain nobel:NobelPrize
nobel:category rdfs:range nobel:Category

nobel:laureate rdf:type 

In [34]:
from rdflib.namespace import OWL

classes = []

for subj, pred, obj in g.triples((None, RDF.type, OWL.Class)):
    if term_in_ns(subj, NOBEL):
        pprint_terms([subj, pred, obj], g)
        classes.append(subj)

nobel:NobelPrize rdf:type owl:Class
nobel:LaureateAward rdf:type owl:Class
nobel:Laureate rdf:type owl:Class
nobel:Category rdf:type owl:Class


In [35]:
for cls in classes:
    for pred, obj in g.predicate_objects(cls):
        pprint_terms([cls, pred, obj], g)
    print()

nobel:NobelPrize rdf:type owl:Class
nobel:NobelPrize rdfs:label "Nobel Prize"
nobel:NobelPrize rdfs:seeAlso <http://dbpedia.org/resource/Nobel_Prize>
nobel:NobelPrize rdfs:comment "The Nobel Prize is a set of annual international awards bestowed in a number of categories by Scandinavian committees in recognition of cultural and scientific advances. The will of the Swedish chemist Alfred Nobel, the inventor of dynamite, established the prizes in 1895. The prizes in Physics, Chemistry, Physiology or Medicine, Literature, and Peace were first awarded in 1901. The Peace Prize is awarded in Oslo, Norway, while the other prizes are awarded in Stockholm, Sweden."
nobel:NobelPrize rdfs:subClassOf <http://dbpedia.org/ontology/Award>

nobel:LaureateAward rdf:type owl:Class
nobel:LaureateAward rdfs:label "Laureate Award"
nobel:LaureateAward rdfs:comment "The Nobel Prize is often divided to several laureates. LaureateAward captures the details of the award given to each laureate, such as share of 

In [36]:
categories = next(g.objects(NOBEL.Category, OWL.oneOf))

In [37]:
from rdflib.term import BNode

for p, o in g.predicate_objects(categories):
    pprint_terms([p,o], g)

rdf:first nobel:Chemistry
rdf:rest _:N9afe345f8441485faab1b2ce4aaabb5e


In [38]:
for p, o in g.predicate_objects(OWL.oneOf):
    pprint_terms([p,o], g)

rdf:type rdf:Property
rdfs:comment "The property that determines the collection of individuals or data values that build an enumeration."
rdfs:domain rdfs:Class
rdfs:isDefinedBy <http://www.w3.org/2002/07/owl#>
rdfs:label "oneOf"
rdfs:range rdf:List


In [39]:
g.parse(OWL.oneOf)

<Graph identifier=N6cd30e441e9a45fa820edbbec0d335b2 (<class 'rdflib.graph.Graph'>)>

In [40]:
for p, o in g.predicate_objects(OWL.oneOf):
    pprint_terms([p,o], g)

rdf:type rdf:Property
rdfs:comment "The property that determines the collection of individuals or data values that build an enumeration."
rdfs:domain rdfs:Class
rdfs:isDefinedBy <http://www.w3.org/2002/07/owl#>
rdfs:label "oneOf"
rdfs:range rdf:List


In [41]:
g.serialize("data/02-nobelprize-data-enriched.nt", format="nt")
!gzip -f data/02-nobelprize-data-enriched.nt
!du -h data/02-nobelprize-data-enriched.nt.gz

816K	data/02-nobelprize-data-enriched.nt.gz


# 3. knowledge graph search: SPARQL

What information is there for laureates?

In [42]:
import gzip

from rdflib import Graph

g = Graph()

with gzip.open("data/02-nobelprize-data-enriched.nt.gz") as f:
    g.parse(f.read())

In [43]:
from rdflib.plugins.sparql import prepareQuery

q = prepareQuery("""
    SELECT ?s ?p ?o WHERE {
        ?s a nobel:Laureate .
        ?s ?p ?o .
    }
""", initNs={"nobel": NOBEL})

In [44]:
for row in take(100, g.query(q)):
    pprint_terms(row, g)

<http://data.nobelprize.org/resource/laureate/82> rdf:type <http://data.nobelprize.org/terms/Laureate>
<http://data.nobelprize.org/resource/laureate/82> rdf:type <http://xmlns.com/foaf/0.1/Person>
<http://data.nobelprize.org/resource/laureate/82> rdfs:label "Nicolay Gennadiyevich Basov"
<http://data.nobelprize.org/resource/laureate/82> <http://xmlns.com/foaf/0.1/givenName> "Nicolay G."
<http://data.nobelprize.org/resource/laureate/82> <http://dbpedia.org/ontology/birthPlace> <http://data.nobelprize.org/resource/city/Usman>
<http://data.nobelprize.org/resource/laureate/82> <http://dbpedia.org/ontology/birthPlace> <http://data.nobelprize.org/resource/country/Russia>
<http://data.nobelprize.org/resource/laureate/82> <http://dbpedia.org/ontology/birthPlace> <http://data.nobelprize.org/resource/country/USSR>
<http://data.nobelprize.org/resource/laureate/82> <http://data.nobelprize.org/terms/laureateAward> <http://data.nobelprize.org/resource/laureateaward/Physics/1964/82>
<http://data.nobel

In [45]:
DBO = Namespace("http://dbpedia.org/ontology/")

In [46]:
def prepQ(q: str):
    return prepareQuery(q, initNs={"nobel": NOBEL, "dbo": DBO})

In [47]:
g.namespace_manager.bind("laureate", Namespace("http://data.nobelprize.org/resource/laureate/"))
g.namespace_manager.bind("country", Namespace("http://data.nobelprize.org/resource/country/"))
g.namespace_manager.bind("city", Namespace("http://data.nobelprize.org/resource/city/"))
g.namespace_manager.bind("university", Namespace("http://data.nobelprize.org/resource/university/"))
g.namespace_manager.bind("dpb", Namespace("http://dbpedia.org/property/"))
g.namespace_manager.bind("dbo", DBO)
g.namespace_manager.bind("nobel", NOBEL)
g.namespace_manager.bind("foaf", FOAF)

What fraction of laureates are affiliated with an institution in a country that is not in their country of birth?

In [48]:
q = prepQ("""
    SELECT (COUNT(?laureate) as ?nlaureates) WHERE {
        ?laureate a nobel:Laureate .
        
        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
    }
""")

for row in g.query(q):
    pprint_terms(row, g)

"1052"^^xsd:integer


In [49]:
q = prepQ("""
    SELECT (COUNT(?laureate) as ?nlaureates) WHERE {
        ?laureate a nobel:Laureate .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        ?icountry a dbo:Country .
    }
""")

for row in g.query(q):
    pprint_terms(row, g)

"785"^^xsd:integer


In [50]:
q = prepQ("""
    SELECT (COUNT(?laureate) as ?nlaureates) WHERE {
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
    }
""")

for row in g.query(q):
    pprint_terms(row, g)


"889"^^xsd:integer


In [51]:
q = prepQ("""
    SELECT (COUNT(?laureate) as ?nlaureates) WHERE {
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        
        FILTER(sameTerm(?bcountry,?icountry))
    }
""")

for row in g.query(q):
    pprint_terms(row, g)

"525"^^xsd:integer


In [52]:
q = prepQ("""
    SELECT (COUNT(?laureate) as ?nlaureates) WHERE {
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        
        FILTER(!sameTerm(?bcountry,?icountry))
    }
""")

for row in g.query(q):
    pprint_terms(row, g)

"364"^^xsd:integer


In [53]:
525 + 364 == 889

True

In [54]:
def as_pct(numer, denom):
    return f"{numer/denom:.1%}"

as_pct(364, 889)

'40.9%'

In [55]:
print(f"""
    PREFIX nobel: <{NOBEL}>
    PREFIX dbo: <{DBO}>
    
    SELECT (COUNT(?laureate) as ?nlaureates) ?icountry ?bcountry WHERE {{
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        
        FILTER(!sameTerm(?bcountry,?icountry))
    }}
    GROUP BY ?icountry ?bcountry
    ORDER BY DESC(?nlaureates)
    LIMIT 5
""")


    PREFIX nobel: <http://data.nobelprize.org/terms/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    
    SELECT (COUNT(?laureate) as ?nlaureates) ?icountry ?bcountry WHERE {
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        
        FILTER(!sameTerm(?bcountry,?icountry))
    }
    GROUP BY ?icountry ?bcountry
    ORDER BY DESC(?nlaureates)
    LIMIT 5



In [56]:
q = prepareQuery(f"""
    PREFIX nobel: <{NOBEL}>
    PREFIX dbo: <{DBO}>
    
    SELECT (COUNT(?laureate) as ?nlaureates) ?icountry ?bcountry WHERE {{
        ?laureate a nobel:Laureate .

        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
        
        ?laureate dbo:affiliation ?institution .
        ?institution dbo:country ?icountry .
        
        FILTER(!sameTerm(?bcountry,?icountry))
    }}
    GROUP BY ?icountry ?bcountry
    ORDER BY DESC(?nlaureates)
    LIMIT 5
""")

for row in g.query(q):
    pprint_terms(row, g)

"21"^^xsd:integer country:USA country:Germany
"18"^^xsd:integer country:USA country:United_Kingdom
"14"^^xsd:integer country:USA country:Canada
"11"^^xsd:integer country:USSR country:Russia
"11"^^xsd:integer country:Germany country:Poland


# 4.  Representing entities: collections and JSON documents

make collections for
- nobel:NobelPrize
- nobel:LaureateAward
- nobel:Laureate
- nobel:Category
- Institutions (objects of dbo:affiliation triples)
- Countries (objects of dbo:country triples)

In [57]:
from collections import defaultdict

class_collection = {
    "nobel:NobelPrize": "nobel_prizes",
    "nobel:LaureateAward": "laureate_awards",
    "nobel:Laureate": "laureates",
    "nobel:Category": "categories",
}

database = defaultdict(lambda: defaultdict(dict))

for cls, collection in class_collection.items():
    q = prepareQuery(f"""
        SELECT ?sub ?pred ?obj WHERE {{
            ?sub a {cls} .
            ?sub ?pred ?obj
        }}
    """, initNs={"nobel": NOBEL})

    for row in g.query(q):
        sub, pred, obj = row
        database[collection][sub][pred] = obj

In [58]:
q = prepareQuery("""
    SELECT ?institution ?pred ?obj WHERE {
        ?sub dbo:affiliation  ?institution .
        ?institution ?pred ?obj
    }
""", initNs={"nobel": NOBEL, "dbo": DBO})

for row in g.query(q):
    sub, pred, obj = row
    database["institutions"][sub][pred] = obj

In [59]:
individual_countries = set()

q = prepareQuery("""
    SELECT DISTINCT ?country WHERE {
        ?sub dbo:country  ?country .
    }
""", initNs={"nobel": NOBEL, "dbo": DBO})

for row in g.query(q):
    individual_countries.add(str(row[0]))

q = prepareQuery("""
    SELECT DISTINCT ?bcountry WHERE {
        ?laureate dbo:birthPlace ?bcountry .
        ?bcountry a dbo:Country .
    }
""", initNs={"nobel": NOBEL, "dbo": DBO})

for row in g.query(q):
    individual_countries.add(str(row[0]))

for country in sorted(individual_countries):
    q = prepareQuery(f"""
        SELECT ?country ?pred ?obj WHERE {{
            <{country}> ?pred ?obj .
        }}
    """, initNs={"nobel": NOBEL, "dbo": DBO})

    for row in g.query(q):
        sub, pred, obj = row
        database["countries"][sub][pred] = obj

In [60]:
from json.encoder import (_make_iterencode, JSONEncoder,
                          encode_basestring_ascii, INFINITY,
                          encode_basestring)

class CustomObjectEncoder(JSONEncoder):

    def iterencode(self, o, _one_shot=False):
        """Encode the given object and yield each string
        representation as available.

        For example::

            for chunk in JSONEncoder().iterencode(bigobject):
                mysocket.write(chunk)
                
        Change from json.encoder.JSONEncoder.iterencode is setting
        _one_shot=False and isinstance=self.isinstance
        in call to `_make_iterencode`.
        And not using `c_make_encoder`.

        """
        if self.check_circular:
            markers = {}
        else:
            markers = None
        if self.ensure_ascii:
            _encoder = encode_basestring_ascii
        else:
            _encoder = encode_basestring

        def floatstr(o, allow_nan=self.allow_nan,
                _repr=float.__repr__, _inf=INFINITY, _neginf=-INFINITY):
            # Check for specials.  Note that this type of test is processor
            # and/or platform-specific, so do tests which don't depend on the
            # internals.

            if o != o:
                text = 'NaN'
            elif o == _inf:
                text = 'Infinity'
            elif o == _neginf:
                text = '-Infinity'
            else:
                return _repr(o)

            if not allow_nan:
                raise ValueError(
                    "Out of range float values are not JSON compliant: " +
                    repr(o))

            return text

        _iterencode = _make_iterencode(
                markers, self.default, _encoder, self.indent, floatstr,
                self.key_separator, self.item_separator, self.sort_keys,
                self.skipkeys, _one_shot=False, isinstance=self.isinstance)
        return _iterencode(o, 0)

In [61]:
import datetime

from rdflib.term import Literal, BNode

class RDFTermEncoder(CustomObjectEncoder):
    def isinstance(self, o, cls):
        if isinstance(o, (Literal, BNode)):
            return False
        return isinstance(o, cls)
    def default(self, o):
        if isinstance(o, Literal):
            rv = {"value": o.value}
            if o.datatype is not None:
                rv["datatype"] = o.datatype
            if o.language is not None:
                rv["lang"] = o.language
            return rv
        if isinstance(o, BNode):
            return "http://localhost/bnode/" + str(o)
        if isinstance(o, datetime.datetime):
            return o.isoformat()
        if isinstance(o, datetime.date):
            return str(o)
        # Let the base class default method raise the TypeError
        return super().default(o)

In [62]:
db = json.loads(json.dumps(database, cls=RDFTermEncoder))

In [102]:
with open("data/03-document-database.json", "w") as f:
    json.dump(db, f, indent=2)

In [103]:
!gzip -f data/03-document-database.json
!du -h data/03-document-database.json.gz

260K	data/03-document-database.json.gz


In [104]:
with gzip.open("data/03-document-database.json.gz") as f:
    db = json.loads(f.read())

In [66]:
from pprint import pprint

for collection_name, collection in db.items():
    for individual, document in take(5, collection.items()):
        print("collection:", collection_name)
        print("id:", individual)
        pprint(document)

collection: nobel_prizes
id: http://data.nobelprize.org/resource/nobelprize/Physiology_or_Medicine/1995
{'http://data.nobelprize.org/terms/category': 'http://data.nobelprize.org/terms/Physiology_or_Medicine',
 'http://data.nobelprize.org/terms/categoryOrder': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
                                                    'value': 3},
 'http://data.nobelprize.org/terms/laureate': 'http://data.nobelprize.org/resource/laureate/452',
 'http://data.nobelprize.org/terms/year': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
                                           'value': 1995},
 'http://purl.org/dc/terms/hasPart': 'http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1995/453',
 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 'http://dbpedia.org/ontology/Award',
 'http://www.w3.org/2000/01/rdf-schema#label': {'lang': 'en',
                                                'value': 'The Nobel Prize in '
          

# 5. Framing linked-data subgraphs as documents: JSON-LD

In [67]:
from pyld import jsonld

In [68]:
db_ld = json.loads(g.serialize(format='json-ld', indent=2))

In [69]:
len(db_ld)

6041

In [70]:
db_ld[0]

{'@id': 'http://data.nobelprize.org/resource/nobelprize/Literature/1921',
 '@type': ['http://dbpedia.org/ontology/Award',
  'http://data.nobelprize.org/terms/NobelPrize'],
 'http://data.nobelprize.org/terms/category': [{'@id': 'http://data.nobelprize.org/terms/Literature'}],
 'http://data.nobelprize.org/terms/categoryOrder': [{'@value': 4}],
 'http://data.nobelprize.org/terms/laureate': [{'@id': 'http://data.nobelprize.org/resource/laureate/590'}],
 'http://data.nobelprize.org/terms/year': [{'@value': 1921}],
 'http://purl.org/dc/terms/hasPart': [{'@id': 'http://data.nobelprize.org/resource/laureateaward/Literature/1921/590'}],
 'http://www.w3.org/2000/01/rdf-schema#label': [{'@language': 'no',
   '@value': 'Nobelprisen i litteratur 1921'},
  {'@language': 'sv', '@value': 'Nobelpriset i litteratur 1921'},
  {'@language': 'en', '@value': 'The Nobel Prize in Literature 1921'}]}

In [71]:
db_ld[-1]

{'@id': 'http://data.nobelprize.org/resource/laureateaward/Physiology_or_Medicine/1964/379',
 '@type': ['http://dbpedia.org/ontology/Award',
  'http://data.nobelprize.org/terms/LaureateAward'],
 'http://data.nobelprize.org/terms/category': [{'@id': 'http://data.nobelprize.org/resource/category/Physiology_or_Medicine'}],
 'http://data.nobelprize.org/terms/laureate': [{'@id': 'http://data.nobelprize.org/resource/laureate/379'}],
 'http://data.nobelprize.org/terms/motivation': [{'@language': 'en',
   '@value': 'for their discoveries concerning the mechanism and regulation of the cholesterol and fatty acid metabolism'},
  {'@language': 'sv',
   '@value': 'för deras upptäckter rörande kolesterol - och fettsyreomsättningens mekanism och reglering'}],
 'http://data.nobelprize.org/terms/share': [{'@value': 2}],
 'http://data.nobelprize.org/terms/sortOrder': [{'@value': 2}],
 'http://data.nobelprize.org/terms/university': [{'@id': 'http://data.nobelprize.org/resource/university/Max-Planck-Insti

In [72]:
for n in g.namespaces():
    print(n)

('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace'))
('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#'))
('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#'))
('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))
('laureate', rdflib.term.URIRef('http://data.nobelprize.org/resource/laureate/'))
('country', rdflib.term.URIRef('http://data.nobelprize.org/resource/country/'))
('city', rdflib.term.URIRef('http://data.nobelprize.org/resource/city/'))
('university', rdflib.term.URIRef('http://data.nobelprize.org/resource/university/'))
('dpb', rdflib.term.URIRef('http://dbpedia.org/property/'))
('dbo', rdflib.term.URIRef('http://dbpedia.org/ontology/'))
('nobel', rdflib.term.URIRef('http://data.nobelprize.org/terms/'))
('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/'))


In [73]:
context = {
    prefix: str(uri) for prefix, uri in g.namespaces()
}

In [74]:
context

{'xml': 'http://www.w3.org/XML/1998/namespace',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#',
 'laureate': 'http://data.nobelprize.org/resource/laureate/',
 'country': 'http://data.nobelprize.org/resource/country/',
 'city': 'http://data.nobelprize.org/resource/city/',
 'university': 'http://data.nobelprize.org/resource/university/',
 'dpb': 'http://dbpedia.org/property/',
 'dbo': 'http://dbpedia.org/ontology/',
 'nobel': 'http://data.nobelprize.org/terms/',
 'foaf': 'http://xmlns.com/foaf/0.1/'}

In [75]:
context["category"] = "http://data.nobelprize.org/resource/category/"

In [76]:
compacted = jsonld.compact(db_ld, context)

In [77]:
compacted.keys()

dict_keys(['@context', '@graph'])

In [78]:
len(compacted["@graph"])

4149

In [79]:
for item in take(5, compacted["@graph"]):
    pprint(item)

{'@id': 'http://data.nobelprize.org/resource/nobelprize/Literature/1921',
 '@type': ['dbo:Award', 'nobel:NobelPrize'],
 'http://purl.org/dc/terms/hasPart': {'@id': 'http://data.nobelprize.org/resource/laureateaward/Literature/1921/590'},
 'nobel:category': {'@id': 'nobel:Literature'},
 'nobel:categoryOrder': 4,
 'nobel:laureate': {'@id': 'laureate:590'},
 'nobel:year': 1921,
 'rdfs:label': [{'@language': 'no', '@value': 'Nobelprisen i litteratur 1921'},
                {'@language': 'sv', '@value': 'Nobelpriset i litteratur 1921'},
                {'@language': 'en',
                 '@value': 'The Nobel Prize in Literature 1921'}]}
{'@id': 'http://data.nobelprize.org/resource/laureateaward/Peace/1971/529',
 '@type': ['nobel:LaureateAward', 'dbo:Award'],
 'http://purl.org/dc/terms/isPartOf': {'@id': 'http://data.nobelprize.org/resource/nobelprize/Peace/1971'},
 'nobel:category': {'@id': 'category:Peace'},
 'nobel:laureate': {'@id': 'laureate:529'},
 'nobel:motivation': [{'@language': '

In [80]:
frame = {
    "@context": context,
    "@type": "nobel:Laureate",
    "@requireAll": True,
    "@explicit": True,
    "foaf:name": {},
    "dbo:birthPlace": {
        "@requireAll": True,
        "@explicit": True,
        "@embed": "@always",
        "@type": "dbo:Country",
    },
    "dbo:affiliation": {
        "@requireAll": True,
        "@explicit": True,
        "@embed": "@always",
        "dbo:country": {},
    }
}

In [81]:
framed = jsonld.frame(compacted, frame)

In [82]:
len(framed["@graph"])

713

In [83]:
pprint(list(take(5, framed["@graph"])))

[{'@id': 'laureate:1',
  '@type': ['nobel:Laureate', 'foaf:Person'],
  'dbo:affiliation': {'@id': 'university:Munich_University',
                      '@type': 'dbo:University',
                      'dbo:country': {'@id': 'country:Germany',
                                      '@type': 'dbo:Country',
                                      'http://www.w3.org/2002/07/owl#sameAs': {'@id': 'http://www.wikidata.org/entity/Q183'},
                                      'rdfs:label': [{'@language': 'en',
                                                      '@value': 'Germany'},
                                                     {'@language': 'no',
                                                      '@value': 'Tyskland'},
                                                     {'@language': 'sv',
                                                      '@value': 'Tyskland'}]}},
  'dbo:birthPlace': [{'@id': 'country:Germany', '@type': 'dbo:Country'},
                     {'@id': 'country:Prussi

In [105]:
with open("data/04-jsonld-framed-laureates.json", "w") as f:
    json.dump(framed, f, indent=2)

In [106]:
!gzip -f data/04-jsonld-framed-laureates.json

# 6. Document collection search: MongoDB

In [86]:
from pymongo import MongoClient

client = MongoClient()

In [87]:
mdb = client["nobel"]

In [107]:
import gzip
import json

with gzip.open("data/04-jsonld-framed-laureates.json.gz") as f:
    framed = json.load(f)

In [89]:
from toolz import assoc

mdb.laureates.drop()
rv = mdb.laureates.insert_many([assoc(doc, "@context", context) for doc in framed["@graph"]])

In [90]:
len(rv.inserted_ids)

713

What fraction of laureates are affiliated with an institution in a country that is not in their country of birth?

In [91]:
def as_list(d):
    return d if isinstance(d, list) else [d]

In [92]:
from toolz import dissoc

n_affiliated_with_nonbirthcountry_institution = 0

for d in mdb.laureates.find():
    countries_affil = {c["@id"] for a in as_list(d["dbo:affiliation"]) for c in as_list(a["dbo:country"])}
    countries_birth = {p["@id"] for p in as_list(d["dbo:birthPlace"])}
    
    if len(countries_affil - countries_birth):
        n_affiliated_with_nonbirthcountry_institution += 1

In [93]:
as_pct(n_affiliated_with_nonbirthcountry_institution, mdb.laureates.count_documents({}))

'34.8%'

What fraction of laureates are affiliated exclusively with institutions that are not in their country of birth?

In [94]:
from toolz import dissoc

n_affiliated_exclusively_with_nonbirthcountry_institutions = 0

for d in mdb.laureates.find():
    countries_affil = {c["@id"] for a in as_list(d["dbo:affiliation"]) for c in as_list(a["dbo:country"])}
    countries_birth = {p["@id"] for p in as_list(d["dbo:birthPlace"])}
    
    if countries_affil.isdisjoint(countries_birth):
        n_affiliated_exclusively_with_nonbirthcountry_institutions += 1

In [95]:
as_pct(n_affiliated_exclusively_with_nonbirthcountry_institutions, mdb.laureates.count_documents({}))

'30.9%'

In [96]:
list(mdb.laureates.find({"dbo:birthPlace.@id": "country:Denmark"}, {"foaf:name": 1, "_id": 0}))

[{'foaf:name': 'Aage N. Bohr'},
 {'foaf:name': 'Niels Bohr'},
 {'foaf:name': 'Jens C. Skou'},
 {'foaf:name': 'August Krogh'},
 {'foaf:name': 'Johannes Fibiger'},
 {'foaf:name': 'Henrik Dam'}]

In [97]:
list(mdb.laureates.find({"dbo:affiliation.dbo:country.@id": "country:Denmark"}, {"foaf:name": 1, "_id": 0}))

[{'foaf:name': 'Aage N. Bohr'},
 {'foaf:name': 'Ben R. Mottelson'},
 {'foaf:name': 'Niels Bohr'},
 {'foaf:name': 'Jens C. Skou'},
 {'foaf:name': 'Niels Ryberg Finsen'},
 {'foaf:name': 'August Krogh'},
 {'foaf:name': 'Johannes Fibiger'},
 {'foaf:name': 'Henrik Dam'},
 {'foaf:name': 'Dale T. Mortensen'}]

In [98]:
len(mdb.laureates.distinct("dbo:birthPlace.@id"))

75

In [99]:
len(mdb.laureates.distinct("dbo:affiliation.dbo:country.@id"))

29

In [108]:
!mongoexport -d nobel -c laureates -o data/05-laureates-mongoexport.jsonl

2021-10-17T19:32:36.077-0400	connected to: mongodb://localhost/
2021-10-17T19:32:36.133-0400	exported 713 records


In [109]:
!gzip -f data/05-laureates-mongoexport.jsonl