# OIH Dashboard pre-processor

This notebook demonstrates some approachs for processing the release graphs into a format that
is useful for the Dashboard UI


In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  ## remove pandas future warning

import s3fs
import pyarrow.parquet as pq
import os
import re
import kglab
from minio import Minio
import rdflib
from rdflib import ConjunctiveGraph  #  needed for nquads

In [3]:
def publicurls(client, bucket, prefix):
    urls = []
    objects = client.list_objects(bucket, prefix=prefix, recursive=True)
    for obj in objects:
        result = client.stat_object(bucket, obj.object_name)

        if result.size > 0:  #  how to tell if an objet   obj.is_public  ?????
            url = client.presigned_get_object(bucket, obj.object_name)
            # print(f"Public URL for object: {url}")
            urls.append(url)

    return urls

In [4]:
# Check for using GPU, in case you want to ensure your GPU is used
gc = kglab.get_gpu_count()
print(gc)

0


In [5]:
client = Minio("ossapi.oceaninfohub.org:80",  secure=False) # Create client with anonymous access.
urls = publicurls(client, "public", "graph")

In [6]:
print(urls)

['http://ossapi.oceaninfohub.org/public/graphs/summonedafricaioc_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedaquadocs_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedcioos_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonededmerp_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonededmo_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedemodnet_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinanodc_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemardocuments_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarexperts_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarinstitutions_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemartraining_v1_release.nq', 'http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarvessels_v1_release.nq', 'http://ossapi.oceaninf

## Single Graph Load

At this point we have the URLs and we could either loop load all of them or pull one out manually and use.  This section dmonstrates loading and working with one


In [6]:
# load quad graph
g = ConjunctiveGraph()
g.parse("http://ossapi.oceaninfohub.org/public/graphs/summonedobis_v1_release.nq", format="nquads")
print(len(g))

161187


In [7]:
namespaces = {
    "shacl":   "http://www.w3.org/ns/shacl#" ,
    "schmea":   "https://schema.org/" ,
    "schemawrong": "http://schema.org/",
    "geo":      "http://www.opengis.net/ont/geosparql#",
}

kg = kglab.KnowledgeGraph(name = "OIH test", base_uri = "https://oceaninfohub.org/id/", namespaces = namespaces, use_gpus=True, import_graph = g)

In [8]:
sparql = """
PREFIX schema: <https://schema.org/>


SELECT ?s ?type ?desc ?name ?url ?keywords
WHERE
{
 ?s rdf:type ?type
   FILTER ( ?type IN (schema:ResearchProject, schema:Project, schema:Organization, 
   schema:Dataset, schema:CreativeWork, schema:Person, schema:Map, schema:Course,
   schema:CourseInstance, schema:Event, schema:Vehicle) )
   ?s schema:description ?desc .
   ?s schema:name ?name
       OPTIONAL { ?s schema:url ?url .   }
   ?s schema:keywords ?keywords

}
"""

pdf = kg.query_as_df(sparql)
# df = pdf   # .to_pandas()  #  breaks with papermill for reasons unknown at this time if to_pandas() is used, needed in my kglab conda env


In [9]:
pdf.to_parquet('output.parquet')


In [10]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20101 entries, 0 to 20100
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   s         20101 non-null  object
 1   type      20101 non-null  object
 2   desc      20101 non-null  object
 3   name      20101 non-null  object
 4   url       20101 non-null  object
 5   keywords  20101 non-null  object
dtypes: object(6)
memory usage: 942.4+ KB


In [11]:
pdf.head(20)


Unnamed: 0,s,type,desc,name,url,keywords
0,<https://obis.org/dataset/1057a007-c31c-48a3-a...,schmea:Dataset,"In Australia, it is thought that up to 26 Aust...",Census of annual pup production by Australian ...,https://obis.org/dataset/1057a007-c31c-48a3-a6...,Occurrence
1,<https://obis.org/dataset/1057a007-c31c-48a3-a...,schmea:Dataset,"In Australia, it is thought that up to 26 Aust...",Census of annual pup production by Australian ...,https://obis.org/dataset/1057a007-c31c-48a3-a6...,Observation
2,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,"Occurrence,Radio transmitters,Animal movements"
3,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,Observation
4,<https://obis.org/dataset/d64477cf-491f-4de5-8...,schmea:Dataset,Original provider:\nObservatorio Ambiental Gra...,Canary Islands - OAG (aggregated per 1-degree ...,https://obis.org/dataset/d64477cf-491f-4de5-82...,Occurrence
5,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,"Occurrence,Marine Animal Survey,Marine Biology..."
6,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,Observation
7,<https://obis.org/dataset/e71d452f-615e-4654-b...,schmea:Dataset,Original provider:\nVirginia Aquarium and Mari...,Virginia and Maryland Sea Turtle Research and ...,https://obis.org/dataset/e71d452f-615e-4654-b7...,Occurrence
8,<https://obis.org/dataset/49f74e10-b23b-4aca-a...,schmea:Dataset,Tow video and epibenthic sled collections were...,"Species assemblages, biomass and regional habi...",https://obis.org/dataset/49f74e10-b23b-4aca-a0...,Occurrence
9,<https://obis.org/dataset/49f74e10-b23b-4aca-a...,schmea:Dataset,Tow video and epibenthic sled collections were...,"Species assemblages, biomass and regional habi...",https://obis.org/dataset/49f74e10-b23b-4aca-a0...,Observation


In [12]:

sparql = """
PREFIX schema: <https://schema.org/>


SELECT ?sc ?geo ?geotype ?geom

WHERE
{
   ?s rdf:type ?type
   FILTER ( ?type IN (schema:ResearchProject, schema:Project, schema:Organization, 
   schema:Dataset, schema:CreativeWork, schema:Person, schema:Map, schema:Course,
   schema:CourseInstance, schema:Event, schema:Vehicle) )
   ?s schema:spatialCoverage ?sc .
   ?sc a  schema:Place .
   ?sc schema:geo ?geo .
   ?geo a ?geotype .
   ?geo schema:polygon ?geom


}
"""

# schema:latitude   schema:longitude



pdf = kg.query_as_df(sparql)
# df = pdf   # .to_pandas()  #  breaks with papermill for reasons unknown at this time if to_pandas() is used, needed in my kglab conda env


In [13]:
pdf.head(20)

Unnamed: 0,sc,geo,geotype,geom
0,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,schmea:GeoShape,"135.96667 -43.63333,135.96667 -35.01667,150.23..."
1,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,schmea:GeoShape,"-74.5 5.5,-74.5 45.5,32.5 45.5,32.5 5.5,-74.5 5.5"
2,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,schmea:GeoShape,"-76.39647 36.58278,-76.39647 38.52142,-74.3984..."
3,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,schmea:GeoShape,"124.05919 -15.94544,124.05919 -15.22044,124.69..."
4,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,<https://gleaner.io/xid/genid/cgf5dprk59mc73ej...,schmea:GeoShape,"-149.5667 -25.15,-149.5667 79.7833,18.5667 79...."
5,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,schmea:GeoShape,"-80.690317 29.954577,-80.690317 30.574196,-79...."
6,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,schmea:GeoShape,"-73.5 12.5,-73.5 56,55 56,55 12.5,-73.5 12.5"
7,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,schmea:GeoShape,"18.291666 -34.501803,18.291666 -34.058889,19.3..."
8,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,schmea:GeoShape,"-74.32758 10.97982,-74.32758 11.26469,-74.1913..."
9,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,<https://gleaner.io/xid/genid/cgf5dq3k59mc73ej...,schmea:GeoShape,"67.083 -56.733,67.083 -9.367,168.41701 -9.367,..."


### Save to JSON Lines format


In [14]:
# Convert data frame to JSON lines format
json_lines = pdf.to_json(orient='records', lines=True)

# Write JSON lines to a file
with open("pdf.jsonl", "w") as f:
    f.write(json_lines)

## loop over the files

In this section we will loop over the files, load them and query and then save to parquet


In [17]:
sparql = """
PREFIX schema: <https://schema.org/>
PREFIX schemawrong: <http://schema.org/>


SELECT ?g ?s ?type ?desc ?name ?url ?keywords
WHERE
{
     ?s rdf:type ?type
    FILTER ( ?type IN (schema:ResearchProject, schema:Project, schema:Organization,
    schema:Dataset, schema:CreativeWork, schema:Person, schema:Map, schema:Course,
    schema:CourseInstance, schema:Event, schema:Vehicle,   schemawrong:ResearchProject, schemawrong:Project, schemawrong:Organization,
    schemawrong:Dataset, schemawrong:CreativeWork, schemawrong:Person, schemawrong:Map, schemawrong:Course,
    schemawrong:CourseInstance, schemawrong:Event, schemawrong:Vehicle  ) )
    OPTIONAL {?s schema:description | schemawrong:description  ?desc .}
    OPTIONAL {?s schema:name | schemawrong:name ?name }
    OPTIONAL { ?s schema:url | schemawrong:url ?url .   }
    OPTIONAL {?s schema:keywords | schemawrong:keywords ?keywords}
 }
"""

In [23]:
for u in urls:
    print("Loading {}".format(u))
    g = ConjunctiveGraph()
    g.parse(u, format="nquads")
    print(len(g))
    namespaces = {
        "shacl":   "http://www.w3.org/ns/shacl#" ,
        "schmea":   "https://schema.org/" ,
        "schemawrong": "http://schema.org/",
        "geo":      "http://www.opengis.net/ont/geosparql#",
    }

    # get the name of the provider from the release graph file name
    result = re.search('(?:summoned)(.*)(?:_v1_)', u)
    extracted_text = result.group(1)

    # import the file
    kg = kglab.KnowledgeGraph(name = "OIH test", base_uri = "https://oceaninfohub.org/id/", namespaces = namespaces, use_gpus=True, import_graph = g)

    # run the query
    pdf = kg.query_as_df(sparql)

    # add a column with the provider name in it
    pdf['provder'] = extracted_text

    # save to parquet
    pdf.to_parquet('./graphparquet/{}.parquet'.format(extracted_text))



Loading http://ossapi.oceaninfohub.org/public/graphs/summonedafricaioc_v1_release.nq
3193
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedaquadocs_v1_release.nq
1102997
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedcioos_v1_release.nq
145779
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmerp_v1_release.nq
73543
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmo_v1_release.nq
101850
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedemodnet_v1_release.nq
1245
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinanodc_v1_release.nq
499
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemardocuments_v1_release.nq
133485
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarexperts_v1_release.nq
14638
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarinstitutions_v1_release.nq
3221
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemartraining_v1_relea

In [27]:
import pandas as pd

directory = './graphparquet'

# This will list all files in the directory that end with .parquet.
files = [f for f in os.listdir(directory) if f.endswith('.parquet')]

# Empty DataFrame to which we'll append each individual file.
combined_df = pd.DataFrame()

for file in files:
    # Complete file path.
    file_path = os.path.join(directory, file)

    # Read the parquet file.
    df = pd.read_parquet(file_path)

    # Append it to the combined DataFrame.
    combined_df = combined_df.append(df)

# If you want to write this combined DataFrame to a new Parquet file:
combined_df.to_parquet('combined.parquet')

## All OIH-Graph load

The following will load all the graphs of the providers in the OIH-Graph.

In [7]:
bg = ConjunctiveGraph()

for u in urls:
    print("Loading {}".format(u))
    bg.parse(u, format="nquads")

print(len(bg))

Loading http://ossapi.oceaninfohub.org/public/graphs/summonedafricaioc_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedaquadocs_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedcioos_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmerp_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonededmo_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedemodnet_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinanodc_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemardocuments_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarexperts_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemarinstitutions_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/summonedinvemartraining_v1_release.nq
Loading http://ossapi.oceaninfohub.org/public/graphs/s

In [8]:
namespaces = {
    "shacl":   "http://www.w3.org/ns/shacl#" ,
    "schmea":   "https://schema.org/" ,
    "schemawrong": "http://schema.org/",
    "geo":      "http://www.opengis.net/ont/geosparql#",
}

bkg = kglab.KnowledgeGraph(name = "OIH test", base_uri = "https://oceaninfohub.org/id/", use_gpus=True, namespaces = namespaces, import_graph = bg)

In [9]:
bkg.save_parquet("OIHGraph_rdf.parquet")
