# Welcome!
This notebook intends to be a template for an end-to-end (or, perhaps more aptly, a discovery-analysis-publish) pipeline for using Open NASA data. In this version, there is much more documentation/tutorial, as we explain _what_ is supposed to be happening and _why_ we have done these things.

## Bookend
Bookend is a sort-of template Jupyter notebook whose purpose is to demonstrate the usefulness of semantic technologies for accelerating research and facilitating [FAIR](https://www.go-fair.org/fair-principles/) and open science.

This takes the form as "bookends" for a normal analytical or data transformation notebook. That is, we start with an initial cell that uses semantic technologies to help discover data. This is done by searching through a _lifted_ SPASE record database. In the future, we want to support local repositories, where you can semantically describe (and thus semantically find) your own work to create reproducible scientific workflows.

Then, we take a snapshot of your computational environment. For the moment, this is a semi-automated mechanism. This is further simplified as we expect the vast majority of these "bookends" to be executed in a [HelioCloud](https://heliocloud.org/) instance. This one way we, as data scientists can preserve experimental context for reproducibility and replicability.

The next step is tie together your computational models and your computational environment.

With that, you then insert your own cells.

Finally, we now want to publish, even just locally, what has been produced. There are a few data fields to fill out by you, the developer. We hope it is not too onerous.

Graphically, `bookend` looks like the following:
![bookend-structure](./figures/bookend-structure.png)

### Knowledge Graphs & Semantic Technologies
Briefly, we will describe all of this metadata (i.e., data about data) in a form called a knowledge graph (KG). You can learn more about this in the following sections.

* [What is Metadata](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_Metadata/What_is_Metadata.md) -- briefly, data about data.
* [What is an Identifier](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_an_Identifier/What_is_an_Identifier.md) -- briefly, identifiers are unique IRIs for a piece of data in a particular namespace.
* [What is a KG](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_a_Knowledge_Graph/What_is_a_Knowledge_Graph.md) -- briefly a KG is a way of relating pieces of data (nodes) through relations (edges) in a human _and_ machine understandable way. They tend to be organized according to a schema, which is frequently an ontology.
* [What is a Taxonomy](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_a_Taxonomy/What_is_a_Taxonomy.md) -- briefly, a taxonomy is a way of hierarchically organizing a set of terms. 
* [What is an Ontology](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_an_Ontology/What_is_an_Ontology.md) -- briefly, ontologies are ways of dictating how different data can be related: for example, the child of a person should always be a person.

#### Ontology Design Patterns
To design the KG for `bookend` we make use of a set of templates for organizing the _schema_ of the KG. We call these Ontology Design Patterns (ODPs). ODPs are self-contained miniature ontologies that solve domain-invariant modeling problems. Our approach uses several to create a modular "plug and play" KG schema (or architecture).
* Computational Environment
* [Computational Observation](https://github.com/kastle-lab/computational-observation-pattern)
* [Data Transformation](https://github.com/Data-Semantics-Laboratory/data-transformation-pattern)

## Bookend Requirements
* [rdflib](https://rdflib.readthedocs.io/en/stable/)
* [sparqlwrapper](https://sparqlwrapper.readthedocs.io/en/latest/)

## The BookBEGINNING

In [None]:
# rdflib is the general purpose python library for modifying a kg in memory and outputting it to a file
import rdflib
## Just some convenient classes to pull out
from rdflib import URIRef, Graph, Namespace, Literal
## namespaces are below. These are where identifiers "live", so to speak.
from rdflib import OWL, RDF, RDFS, XSD, TIME

# sparqlwrapper is used to query a triplestore
import SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON

## Prefixes

In [None]:
# Some default prefixes for namespaces.
# Which are generally useful
pfs = {
"geo": Namespace("http://www.opengis.net/ont/geosparql#"),
"geof": Namespace("http://www.opengis.net/def/function/geosparql/"),
"sf": Namespace("http://www.opengis.net/ont/sf#"),
"wd": Namespace("http://www.wikidata.org/entity/"),
"wdt": Namespace("http://www.wikidata.org/prop/direct/"),
"dbo": Namespace("http://dbpedia.org/ontology/"),
"time": Namespace("http://www.w3.org/2006/time#"),
"ssn": Namespace("http://www.w3.org/ns/ssn/"),
"sosa": Namespace("http://www.w3.org/ns/sosa/"),
"cdt": Namespace("http://w3id.org/lindt/custom_datatypes#"),
"ex": Namespace("https://example.com/"),
"rdf": RDF,
"rdfs": RDFS,
"xsd": XSD,
"owl": OWL,
"time": TIME
}

# The namespace and prefixes which we will use for the metadata storage
name_space = "https://polyneme.xyz/"
pfs["polyr"] = Namespace(f"{name_space}lod/resource#")
pfs["poly-ont"] =  Namespace(f"{name_space}lod/ontology#")

## Storing Metadata
It should perhaps come as no surprise the rest of the notebook, but we will store the metadata generated in this notebook in a knowledge graph. For now, it will stay in memory as `Graph` from `rdflib`. When we publish the dataset generated in this notebook, we will upload the dataset into a graph database. 

In [None]:
def init_kg(prefixes=pfs):
    kg = Graph()
    for prefix in pfs:
        kg.bind(prefix, pfs[prefix])
    return kg
# rdf:type shortcut
a = pfs["rdf"]["type"]

# Initialize an empty graph
graph = init_kg()

## Accessing Your Local Graph Database
For this notebook, we assume you are running a `developer` (i.e., non-production) Apache Jena Fuseki triplestore as your graph database. This will be useful in several different cells.

In [None]:
db_loc = "http://localhost:3030"
dataset = "bookend"
endpoint_url = f"{db_loc}/{dataset}"
endpoint = SPARQLWrapper(endpoint_url) # default location of fuseki
endpoint.setReturnFormat(JSON)

# Construct the query
query = """
    PREFIX poly-ont: <https://polyneme.xyz/lod/ontology#>

    SELECT *
    WHERE {
        ?person a poly-ont:Person .
    }
    """
# Set the query
endpoint.setQuery(query)

try:
    ret = endpoint.queryAndConvert()
    print("Connection success.")
except Exception as e:
    print(e)

## Local SPASE
The current expectation is that there is an RDF-ified version of the SPASE dataset locally available. For more information in generating this, see the [spase-rdf-tools](https://github.com/polyneme/spase-rdf-tools/tree/master/spase_rdf_tools)

In [None]:
spase_dataset = "spase-rdf"
spase_endpoint_url = f"{db_loc}/{spase_dataset}"

spase_endpoint = SPARQLWrapper(spase_endpoint_url) # default location of fuseki
spase_endpoint.setReturnFormat(JSON)

# Construct the query
query = """
    PREFIX poly-ont: <https://polyneme.xyz/lod/ontology#>

    SELECT *
    WHERE {
        ?person a poly-ont:Person .
    }
    """
# Set the query
spase_endpoint.setQuery(query)

try:
    ret = spase_endpoint.queryAndConvert()
    print("Connection success.")
except Exception as e:
    print(e)

In [None]:
## Utility Functions

In [None]:
## This query will be useful
def instanceExists(uri, sparql=endpoint):
    query = f"""
    PREFIX poly-ont: <https://polyneme.xyz/lod/ontology#>

    SELECT *
    WHERE {{
        <{uri}> ?p ?o .
    }}
    LIMIT 1
    """
    sparql.setQuery(query)
    try:
        ret = sparql.queryAndConvert()
        return len(ret["results"]["bindings"]) == 1
    except Exception as e:
        print(e)

## Capturing the Current Computational Environment
![computational environment](./figures/computational-environment-pattern.png)
The purpose of this is to capture the environment in which you transform data (i.e., create something new from something old). This is useful for replicability.

In [None]:
# Code to populate this pattern for this notebook

## Mint a URI for this computational environment
### There are many ways to create an identifier
### We have chosen a way that encodes some information for identifiability, without searching for the label.
comp_env_name = "polyneme.donny.home"
#### Some other examples might be
#### "wright.cogan.campus"
#### "organization.name.location"
comp_env_uri = pfs["polyr"][comp_env_name]
## Check to see if the computational environment exists
if instanceExists(comp_env_uri):
    ## Check if there needs to be updates
    pass
    ## Otherwise, moveon
    pass
else:
    ## If you have done this before (i.e., this is not your first time running this notebook) AND your 
    ## computational environment hasn't changed.
    graph.add( (comp_env_uri, a, pfs["poly-ont"]["ComputationalEnvironment"]) )
    ### TODO read comp_env from config

## Dataset Discovery
This is effectively the "Extract" part of the ETL loop.

In [24]:
# This code is currently set to query for data products (spase:NumericalData) by keyword.
keyword = "GOES"
query = f"""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spase: <http://www.spase-group.org/data/schema/>

SELECT ?label ?uri WHERE {{
  ?uri a spase:NumericalData ;
     spase:keyword ?"{keyword}" ;
     rdfs:label ?label .
}} LIMIT 100
"""

spase_endpoint.setQuery(query)
try:
    ret = spase_endpoint.queryAndConvert()
    for res in ret["results"]["bindings"]:
        print(f'{res["label"]["value"]}: {res["uri"]["value"]}')
except Exception as e:
    print(e)

NOAA_NumericalData_GOES_13_EPS_EPEAD_E13EW_PT1M: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_13_EPS_EPEAD_E13EW_PT1M
NOAA_NumericalData_GOES_15_EPS_MAGPD_19MP4_PT32S: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_15_EPS_MAGPD_19MP4_PT32S
NOAA_NumericalData_GOES_13_EPS_EPEAD_A16W_PT32S: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_13_EPS_EPEAD_A16W_PT32S
NOAA_NumericalData_GOES_14_EPS_EPEAD_A16W_PT32S: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_14_EPS_EPEAD_A16W_PT32S
NOAA_NumericalData_GOES_13_EPS_MAGED_19ME5_PT32S: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_13_EPS_MAGED_19ME5_PT32S
NOAA_NumericalData_GOES_13_EPS_EPEAD_CPFLUX_PT5M: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_13_EPS_EPEAD_CPFLUX_PT5M
NOAA_NumericalData_GOES_15_EPS_MAGPD_19MP15_PT5M: http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_15_EPS_MAGPD_19MP15_PT5M
NOAA_NumericalData_GOES_6_SEM_A_PT5M: ht

In [28]:
# Now, you choose one of those label, URI pairs 
dataset_label = "NOAA_NumericalData_GOES_13_EPS_EPEAD_E13EW_PT1M"
dataset_uri = "http://www.spase-group.org/data/schema/NOAA_NumericalData_GOES_13_EPS_EPEAD_E13EW_PT1M"

query = f"""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spase: <http://www.spase-group.org/data/schema/>

SELECT ?actual_url WHERE {{
 <{dataset_uri}> spase:has_access_information ?o .
  ?o spase:has_access_url ?url .
  ?url spase:name ?"HTTPS from SPDF" ;
       spase:url ?actual_url .
}} LIMIT 100
"""

spase_endpoint.setQuery(query)
try:
    ret = spase_endpoint.queryAndConvert()
    print("You can get your data from here. In the future, we will attempt to get this for you.")
    for res in ret["results"]["bindings"]:
        print(f'{res["actual_url"]["value"]}')
except Exception as e:
    print(e)

# It is important to note that we have a record of which datasets we start with
## Populated sample dataset names
data_inputs = [dataset_uri]

You can get your data from here. In the future, we will attempt to get this for you.
https://spdf.gsfc.nasa.gov/pub/data/goes/goes13/epead-electrons/e13ew_1min/


## This is where your code goes!
This is effectively the "Transform" part of the ETL loop.

In [30]:
## Import Stuff
pass

In [29]:
## Do Stuff
pass

## The BookEND
The following are data fields that `bookend` needs from you in order to encode your results and publish them to the local repository.

In [None]:
# TODO support multiple output datasets from a single script
# It is important to provide the name and location of where the data exists
# Optionally, you can give the format (e.g., csv)
result_name = "example.data"
result_loc  = "https://example.org/sample/data/loc"
result_format = "csv"
## Uniqueness is necessary!
output_uri = pfs["polyr"][f"Data.{result_name}"]
if instanceExists(output_uri):
    print("You have chosen a result name which exists in the graph database already.")
    print("If you continue, you will accidentally merge any metadata for this with any previously committed version.")

The following is a small, natural language blurb of what you've done in your script.

In [None]:
transformation_description = """
text goes here!
"""

## Dataset Publishing (Internal)
Now it is time to publish your work. This is effectively the "Load" part of the ETL loop.

### Computational Lineage
This is the result of combining the two ODPs (`Computational Observation` and `Data Transformation`) into something that better fits our use-case.

![computational-lineage](./figures/computational-lineage-pattern.png)

In [None]:
# Modeling
## Output Data
output_uri = pfs["polyr"][f"Data.{result_name}"]
graph.add((output_uri, a, pfs["poly-ont"]["Data"])) # declaration
### Associate the dataset with its actual data (i.e., the payload)
#### This reification exists in case we want to have other metadata about the location link (e.g., access permissions)
payload_uri = pfs["polyr"][f"Payload.{result_name}"]
graph.add((payload_uri, a, pfs["poly-ont"]["Payload"])) # declaration
loc_uri = URIRef(result_loc)
graph.add((output_uri, pfs["poly-ont"]["hasPayload"], payload_uri)) # association
graph.add((payload_uri, pfs["poly-ont"]["hasLocation"], loc_uri)) # association

### Associate the datatype to the dataset
format_uri = pfs["polyr"][f"DataType.{result_format}"]
graph.add( (output_uri, pfs["poly-ont"]["hasDataType"], format_uri) )

## DataTransformation

### TODO global notion of uniqueness for dt
dt_id = 1
dt_uri = pfs["polyr"][f"DataTransformation.{dt_id}"]
graph.add( (dt_uri, a, pfs["poly-ont"]["DataTransformation"]) )
graph.add( (dt_uri, pfs["poly-ont"]["occursInCE"], comp_env_uri))
### TODO spatiotemporal extent (or at least temporal extent)
### TODO association of the DataTransformation with ComputationalModelExecution

## Input Data & Data Roles
for data_input in data_inputs:
    input_uri = pfs["polyr"][f"Data.{data_input}"]
    graph.add((input_uri, a, pfs["poly-ont"]["Data"])) # declaration
    
    ## Mint a new input data role
    input_role_uri = pfs["polyr"][f"InputRole.DT{dt_id}.{data_input}"]
    graph.add((input_role_uri, a, pfs["poly-ont"]["InputRole"] )) # declaration
    graph.add((input_uri, pfs["poly-ont"]["performsInputRole"], input_role_uri)) # association
    graph.add((dt_uri, pfs["poly-ont"]["providesInputRole"], input_role_uri)) # association

## Output Data Role
output_role_uri = pfs["polyr"][f"OutputRole.DT{dt_id}.{result_name}"]
graph.add((output_role_uri, a, pfs["poly-ont"]["InputRole"] )) # declaration
graph.add((output_uri, pfs["poly-ont"]["performsOutputRole"], output_role_uri)) # association
graph.add((dt_uri, pfs["poly-ont"]["providesOutputRole"], output_role_uri)) # association
pass # prevents cell output

In [None]:
output_file = "output.ttl"
temp = graph.serialize(format="turtle", encoding="utf-8", destination=output_file)

In [None]:
# Push to graph database
pass