# OIH Dashboard pre-processor

This notebook demonstrates some approachs for processing the release graphs into a format that
is useful for the Dashboard UI


In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  ## remove pandas future warning

import s3fs
import kglab
from minio import Minio
import rdflib
from rdflib import ConjunctiveGraph  #  needed for nquads

In [2]:
def publicurls(client, bucket, prefix):
    urls = []
    objects = client.list_objects(bucket, prefix=prefix, recursive=True)
    for obj in objects:
        result = client.stat_object(bucket, obj.object_name)

        if result.size > 0:  #  how to tell if an objet   obj.is_public  ?????
            url = client.presigned_get_object(bucket, obj.object_name)
            # print(f"Public URL for object: {url}")
            urls.append(url)

    return urls

In [3]:
# Check for using GPU, in case you want to ensure your GPU is used
gc = kglab.get_gpu_count()
print(gc)

0


In [4]:
client = Minio("ossapi.oceaninfohub.org:80",  secure=False) # Create client with anonymous access.
urls = publicurls(client, "public", "graph")

In [5]:
print(urls)

['http://ossapi.oceaninfohub.org/public/graphs/summonedafricaioc_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedaquadocs_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedcioos_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonededmerp_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonededmo_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedemodnet_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinanodc_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemardocuments_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarexperts_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarinstitutions_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemartraining_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarvessels_v1_release.nq', 'http://ossapi.oceaninf

## Single Graph Load

At this point we have the URLs and we could either loop load all of them or pull one out manually and use.  This section dmonstrates loading and working with one


In [6]:
# load quad graph
g = ConjunctiveGraph()
g.parse("http://ossapi.oceaninfohub.org/public/graphs/summonedobis_v1_release.nq", format="nquads")
print(len(g))

161187


In [7]:
namespaces = {
    "shacl":   "http://www.w3.org/ns/shacl#" ,
    "schmea":   "https://schema.org/" ,
    "geo":      "http://www.opengis.net/ont/geosparql#",
}

kg = kglab.KnowledgeGraph(name = "OIH test", base_uri = "https://oceaninfohub.org/id/", namespaces = namespaces, use_gpus=True, import_graph = g)

In [16]:
sparql = """
PREFIX schema: <https://schema.org/>


SELECT ?s ?type ?desc ?name ?url ?keywords
WHERE
{
 ?s rdf:type ?type
   FILTER ( ?type IN (schema:ResearchProject, schema:Project, schema:Organization, 
   schema:Dataset, schema:CreativeWork, schema:Person, schema:Map, schema:Course,
   schema:CourseInstance, schema:Event, schema:Vehicle) )
   ?s schema:description ?desc .
   ?s schema:name ?name
       OPTIONAL { ?s schema:url ?url .   }
   ?s schema:keywords ?keywords

}
"""

pdf = kg.query_as_df(sparql)
# df = pdf   # .to_pandas()  #  breaks with papermill for reasons unknown at this time if to_pandas() is used, needed in my kglab conda env


In [17]:
pdf.head(20)

Unnamed: 0,s,type,desc,name,url,keywords
0,<https://obis.org/dataset/1057a007-c31c-48a3-a...,schmea:Dataset,"In Australia, it is thought that up to 26 Aust...",Census of annual pup production by Australian ...,https://obis.org/dataset/1057a007-c31c-48a3-a6...,Occurrence
1,<https://obis.org/dataset/1057a007-c31c-48a3-a...,schmea:Dataset,"In Australia, it is thought that up to 26 Aust...",Census of annual pup production by Australian ...,https://obis.org/dataset/1057a007-c31c-48a3-a6...,Observation
2,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,"Occurrence,Radio transmitters,Animal movements"
3,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,Observation
4,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,Occurrence
5,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,"Occurrence,Marine Animal Survey,Marine Biology..."
6,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,Observation
7,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,Occurrence
8,<https://obis.org/dataset/49f74e10-b23b-4aca-a...,schmea:Dataset,Tow video and epibenthic sled collections were...,"Species assemblages, biomass and regional habi...",https://obis.org/dataset/49f74e10-b23b-4aca-a0...,Occurrence
9,<https://obis.org/dataset/49f74e10-b23b-4aca-a...,schmea:Dataset,Tow video and epibenthic sled collections were...,"Species assemblages, biomass and regional habi...",https://obis.org/dataset/49f74e10-b23b-4aca-a0...,Observation


In [52]:

sparql = """
PREFIX schema: <https://schema.org/>


SELECT ?sc ?geo ?geotype ?geom

WHERE
{
   ?s rdf:type ?type
   FILTER ( ?type IN (schema:ResearchProject, schema:Project, schema:Organization, 
   schema:Dataset, schema:CreativeWork, schema:Person, schema:Map, schema:Course,
   schema:CourseInstance, schema:Event, schema:Vehicle) )
   ?s schema:spatialCoverage ?sc .
   ?sc a  schema:Place .
   ?sc schema:geo ?geo .
   ?geo a ?geotype .
   ?geo schema:polygon ?geom


}
"""

# schema:latitude   schema:longitude



pdf = kg.query_as_df(sparql)
# df = pdf   # .to_pandas()  #  breaks with papermill for reasons unknown at this time if to_pandas() is used, needed in my kglab conda env


In [53]:
pdf.head(20)

### Save to JSON Lines format


In [21]:
# Convert data frame to JSON lines format
json_lines = pdf.to_json(orient='records', lines=True)

# Write JSON lines to a file
with open("pdf.jsonl", "w") as f:
    f.write(json_lines)

## All OIh-Graph load

The following will load all the graphs of the providers in the OIH-Graph.

In [29]:
bg = ConjunctiveGraph()

for u in urls:
    print("Loading {}".format(u))
    bg.parse(u, format="nquads")

print(len(bg))

Loading http://ossapi.oceaninfohub.org/public/graphs/summonedafricaioc_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedaquadocs_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedcioos_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmerp_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmo_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedemodnet_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinanodc_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemardocuments_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarexperts_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarinstitutions_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemartraining_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/s

In [30]:
namespaces = {
    "shacl":   "http://www.w3.org/ns/shacl#" ,
    "schmea":   "https://schema.org/" ,
    "geo":      "http://www.opengis.net/ont/geosparql#",
}

bkg = kglab.KnowledgeGraph(name = "OIH test", base_uri = "https://oceaninfohub.org/id/", use_gpus=True, namespaces = namespaces, import_graph = bg)

In [31]:
sparql = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>


SELECT ?p (COUNT(?p) as ?count)
WHERE
{
  ?s ?p ?o .
}
GROUP BY ?p ORDER BY DESC(?count)
"""

pdf = bkg.query_as_df(sparql)
# df = pdf   # .to_pandas()  #  breaks with papermill for reasons unknown at this time if to_pandas() is used, needed in my kglab conda env


In [32]:
pdf.head()

Unnamed: 0,p,count
0,rdf:type,634761
1,schmea:name,317029
2,schmea:keywords,306362
3,schmea:url,189343
4,schmea:description,135034


In [None]:
# Check for using GPU, in case you want to ensure your GPU is used
gc = kglab.get_gpu_count()
print(gc)

In [32]:
bkg.save_parquet("OIHGraph_25032023.parquet")
