# Welcome!
This notebook intends to be a template for an end-to-end (or, perhaps more aptly, a discovery-analysis-publish) pipeline for using Open NASA data. In this version, there is much more documentation/tutorial, as we explain _what_ is supposed to be happening and _why_ we have done these things.

![bookend-for-pids-context](./figures/bookend4pids-context.png)

Essentially, we want to ensure maximal interoperability and discoverability. In particular, we wish to expand the various representations of SPASE records into other PID formats, namely: ROR (Research Organization Registry), RAID (Research Activity ID), PIDINST

## Bookend for PIDs
Bookend is a sort-of template Jupyter notebook whose purpose is to demonstrate the usefulness of semantic technologies for accelerating research and facilitating [FAIR](https://www.go-fair.org/fair-principles/) and open science. This is generally completed through "bookending" novel code with cells that capture context and publish output with appropriate metdata.

![interchange](./figures/spase-raid-pidinst.png)

This `Bookend` is for emitting metadata in specific formats: SPASE, RAID, and PIDINST. The interchange format, to keep these `Bookends` thematically unified, is RDF, and thus interoperable with the larger knowledge graph ecosystem.

![spase-to-kg](./figures/spase-to-kg.png)

The PIDINST schema is depicted below. `hasValue` and `hasProperty` are included for explainability. The root node of a PIDINST document is the `Instrument` that has various properties. Each of these properties (in the left box) has a value (depicted through `hasValue`). There are several sub-root nodes for collections (e.g., `InstrumentTypes` and `InstrumentType`). These relations are left unlabled. These sub-root nodes may also have values.
![pidinst-schema](./figures/pidinst-schema.png)

## Bookend for PIDs Requirements
* [rdflib](https://rdflib.readthedocs.io/en/stable/)
* [sparqlwrapper](https://sparqlwrapper.readthedocs.io/en/latest/)

## The BookBEGINNING

In [1]:
# rdflib is the general purpose python library for modifying a kg in memory and outputting it to a file
import rdflib
## Just some convenient classes to pull out
from rdflib import URIRef, Graph, Namespace, Literal
## namespaces are below. These are where identifiers "live", so to speak.
from rdflib import OWL, RDF, RDFS, XSD, TIME

# sparqlwrapper is used to query a triplestore
import SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON

## Prefixes

In [13]:
# Some default prefixes for namespaces.
# Which are generally useful
pfs = {
"geo": Namespace("http://www.opengis.net/ont/geosparql#"),
"geof": Namespace("http://www.opengis.net/def/function/geosparql/"),
"sf": Namespace("http://www.opengis.net/ont/sf#"),
"wd": Namespace("http://www.wikidata.org/entity/"),
"wdt": Namespace("http://www.wikidata.org/prop/direct/"),
"dbo": Namespace("http://dbpedia.org/ontology/"),
"time": Namespace("http://www.w3.org/2006/time#"),
"ssn": Namespace("http://www.w3.org/ns/ssn/"),
"sosa": Namespace("http://www.w3.org/ns/sosa/"),
"cdt": Namespace("http://w3id.org/lindt/custom_datatypes#"),
"ex": Namespace("https://example.com/"),
"rdf": RDF,
"rdfs": RDFS,
"xsd": XSD,
"owl": OWL,
"time": TIME
}

# The namespace and prefixes which we will use for the metadata storage
name_space = "https://polyneme.xyz/"
pfs["polyr"] = Namespace(f"{name_space}lod/resource#")
pfs["poly-ont"] =  Namespace(f"{name_space}lod/ontology#")
ns1 = Namespace("http://www.spase-group.org/data/schema/")
ns2 = Namespace("<http://purl.org/dc/elements/1.1/")

## KG Data Structure
The KG data structure, i.e., what is storing the metadata (for now) stays in memory as `Graph` from `rdflib`.

In [3]:
def init_kg(prefixes=pfs):
    kg = Graph()
    for prefix in pfs:
        kg.bind(prefix, pfs[prefix])
    return kg
# rdf:type shortcut
a = pfs["rdf"]["type"]

# Initialize an empty graph
graph = init_kg()

## Convert SPASE Record to Interchange Format (RDF)
This code is heavily based on parallel work within the Polyneme TOPST effort for converting SPASE records taken from CDAWeb to RDF. It can be found in the appropriate online repository: [topst-spase-rdf-tools](https://github.com/polyneme/topst-spase-rdf-tools).

In [11]:
from utils.spase_to_rdf import create_python_model_from_xsd
from utils.spase_to_rdf import create_owl_from_python_module
from utils.spase_to_rdf import xml_to_rdf
import xml.etree.ElementTree as ET

# Using the code from `topst-spase-rdf-tools` create an RDF graph of the SPASE record
# This currently loads exactly one sample record: PT0.512S
create_python_model_from_xsd("./data/spase-2.6.0.xsd", "spase_model")
create_owl_from_python_module("spase_model", "./data/spase.owl")
xml_to_rdf("./data/spase-data/", "spase_model", "./data/", partition_number=1)

# Load the graph into memory
graph.parse("./data/spase.ttl")


Parsing schema file:///home/cogan/repos/topst/bookend/data/spase-2.6.0.xsd
Compiling schema file:///home/cogan/repos/topst/bookend/data/spase-2.6.0.xsd
Builder: 319 main and 0 inner classes
Analyzer input: 319 main and 0 inner classes
Analyzer output: 122 main and 0 inner classes
Generating package: init
Generating package: spase_model.spase_2_6_0


Python model created in: spase_model


Processing XML files: 100%|███████████████████████| 1/1 [00:00<00:00, 37.91it/s]


<Graph identifier=N7f249075e1774fdc93545a3b534ef025 (<class 'rdflib.graph.Graph'>)>

In [21]:
for s,p,o in graph.triples((None, a, ns1["Instrument"])):
    break # There should only be one anyway # TODO make this check
instrument_name = s.split("/")[-1].replace("_", " ")
print(instrument_name)

for s,p,o in graph.triples((None, a, ns1["NumericalData"])):
    pass

# Extract additional fields

SMWG Instrument GOES 12 MAG


## Emit PIDINST Record

![kg-to-pidints](./figures/kg-to-pidinst.png)
![spase-pidinst-mapping](./figures/spase-pidinst-mapping.png)

In [25]:
import xml.etree.ElementTree as ET

# Create the root element
root = ET.Element("instrument")
# Properties from the PIDINST schema
properties = ["identifier", "schemaVersion", "landingPage", "name", "owners", "manufacturers", "instrumentTypes", "description", "relatedIdentifiers", "alternateIdentifiers"]
# Mapping the values from the SPASE RDF (or other known values) to the PIDINST Schema
values = {
    "identifier": "",
    "schemaVersion": "1.0.0", # default value
    "landingPage": "",
    "name": instrument_name, 
    "description": ""
}

# For each of the known properties
for prop in properties:
    # 
    try:
        if len(values[prop]) != 0: # detects empty collections and empty values
            child = ET.SubElement(root, prop)
            child.text = values[prop]
            
            if prop[-1] == "s": # A shortcut for detecting collections
                member = prop[:1]
                member_node = ET.SubElement(child, member)
                # member.set("name", "value")
    except KeyError as e:
        continue

# Convert the tree to a string
xml_string = ET.tostring(root, encoding="unicode")

# Write the XML file
with open("./data/pidinst.xml", "w") as f:
    f.write(xml_string)

## Emit RAID Record

## Emit ROR Record