This demos getting the RDF content for the domain knowledge graph and doing some querying on it using RDFLib in Python

In [None]:
!python -m pip install rdflib pystow

In [11]:
import rdflib
import pystow
from tabulate import tabulate
import pandas as pd

In [8]:
domain = "epi"
version = "2023-02-13"
url = f"https://askem-mira.s3.amazonaws.com/dkg/{domain}/build/{version}/dkg.ttl.gz"

In [9]:
graph = rdflib.Graph()

with pystow.ensure_open_gz("mira", domain, version, url=url) as file:
    graph.parse(file)

Downloading dkg.ttl.gz: 0.00B [00:00, ?B/s]

Use SPARQL to get all distinct predicates

In [13]:
results = graph.query("""\
    SELECT DISTINCT ?p
    WHERE {
        ?s ?p ?o
    }
""")

predicates_df = pd.DataFrame(results, columns=["predicate"])
predicates_df

Unnamed: 0,0
0,http://www.w3.org/2000/01/rdf-schema#label
1,http://purl.org/dc/terms/description
2,https://bioregistry.io/oboinowl:hasExactSynonym
3,http://www.w3.org/2000/01/rdf-schema#isDefinedBy
4,http://purl.org/dc/terms/hasVersion
...,...
369,https://bioregistry.io/vo:0000818
370,http://purl.obolibrary.org/obo/uberon/core#tru...
371,https://bioregistry.io/ndfrt:has_PK
372,https://bioregistry.io/ro:0002159


Use SPARQL to get all parent-child relationships (using `rdfs:subClassOf` as predicate whose subject is the child and whose object is the parent)

In [14]:
results = graph.query("""\
    SELECT ?child ?childLabel ?parent ?parentLabel
    WHERE {
        ?child rdfs:subClassOf ?parent .
        ?child rdfs:label ?childLabel .
        ?parent rdfs:label ?parentLabel .
    }
    LIMIT 5
""")

parents_df = pd.DataFrame(results, columns=["child", "child_label", "parent", "parent_label"])
parents_df

Unnamed: 0,child,child_label,parent,parent_label
0,https://bioregistry.io/oae:0008001,dysstasia AE,https://bioregistry.io/oae:0002049,
1,https://bioregistry.io/oae:0004687,neurotoxicity AE,https://bioregistry.io/oae:0001215,
2,https://bioregistry.io/oae:0004923,cerebral hemorrhage AE,https://bioregistry.io/oae:0000801,
3,https://bioregistry.io/oae:0004931,cerebral venous sinus thrombosis AE,https://bioregistry.io/oae:0004534,
4,https://bioregistry.io/oae:0004935,cerebrovascular accident AE,https://bioregistry.io/oae:0000534,


Use SPARQL to use multiple predicates (both for child->parent and part->whole)

In [20]:
results = graph.query("""\
    PREFIX bfo: <https://bioregistry.io/bfo:>

    SELECT ?child ?childLabel ?parent ?parentLabel
    WHERE {
        ?child rdfs:subClassOf|bfo:0000050 ?parent .
        ?child rdfs:label ?childLabel .
        ?parent rdfs:label ?parentLabel .
    }
    LIMIT 5
""")

parents_df = pd.DataFrame(results, columns=["child", "child_label", "parent", "parent_label"])
parents_df

Unnamed: 0,child,child_label,parent,parent_label
0,http://www.w3.org/2000/01/rdf-schema#Container,Container,http://www.w3.org/2000/01/rdf-schema#Resource,Resource
1,http://www.w3.org/2000/01/rdf-schema#Literal,Literal,http://www.w3.org/2000/01/rdf-schema#Resource,Resource
2,http://www.w3.org/2002/07/owl#AllDifferent,AllDifferent,http://www.w3.org/2000/01/rdf-schema#Resource,Resource
3,http://www.w3.org/2002/07/owl#AllDisjointClasses,AllDisjointClasses,http://www.w3.org/2000/01/rdf-schema#Resource,Resource
4,http://www.w3.org/2002/07/owl#AllDisjointPrope...,AllDisjointProperties,http://www.w3.org/2000/01/rdf-schema#Resource,Resource
