# Welcome!
This notebook intends to be a template for an end-to-end (or, perhaps more aptly, a discovery-analysis-publish) pipeline for using Open NASA data. In this version, there is much more documentation/tutorial, as we explain _what_ is supposed to be happening and _why_ we have done these things.

### Bookend Structure
![bookend-structure](./figures/bookend-structure.png)

### Knowledge Graphs & Semantic Technologies
* [What is Metadata](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_Metadata/What_is_Metadata.md)
* [What is an Identifier](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_an_Identifier/What_is_an_Identifier.md)
* [What is a KG](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_a_Knowledge_Graph/What_is_a_Knowledge_Graph.md)
* [What is a Taxonomy](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_a_Taxonomy/What_is_a_Taxonomy.md)
* [What is an Ontology](https://github.com/KGConf/open-kg-curriculum/blob/master/curriculum/modules/What_is_an_Ontology/What_is_an_Ontology.md)

#### Ontology Design Patterns
Ontology Design Patterns (ODPs) are self-contained miniature ontologies that solve domain-invariant modeling problems. Our approach uses several to create a modular "plug and play" KG schema (or architecture).
* Computational Environment
* [Computational Observation](https://github.com/kastle-lab/computational-observation-pattern)
* [Data Transformation](https://github.com/Data-Semantics-Laboratory/data-transformation-pattern)

## Bookend Software
* [rdflib](https://rdflib.readthedocs.io/en/stable/)
* [sparqlwrapper](https://sparqlwrapper.readthedocs.io/en/latest/)

In [106]:
# rdflib is the general purpose python library for modifying a kg in memory and outputting it to a file
import rdflib
## Just some convenient classes to pull out
from rdflib import URIRef, Graph, Namespace, Literal
## namespaces are below. These are where identifiers "live", so to speak.
from rdflib import OWL, RDF, RDFS, XSD, TIME

# sparqlwrapper is used to query a triplestore
import SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON

## Prefixes

In [107]:
# Some default prefixes for namespaces.
# Which are generally useful
pfs = {
"geo": Namespace("http://www.opengis.net/ont/geosparql#"),
"geof": Namespace("http://www.opengis.net/def/function/geosparql/"),
"sf": Namespace("http://www.opengis.net/ont/sf#"),
"wd": Namespace("http://www.wikidata.org/entity/"),
"wdt": Namespace("http://www.wikidata.org/prop/direct/"),
"dbo": Namespace("http://dbpedia.org/ontology/"),
"time": Namespace("http://www.w3.org/2006/time#"),
"ssn": Namespace("http://www.w3.org/ns/ssn/"),
"sosa": Namespace("http://www.w3.org/ns/sosa/"),
"cdt": Namespace("http://w3id.org/lindt/custom_datatypes#"),
"ex": Namespace("https://example.com/"),
"rdf": RDF,
"rdfs": RDFS,
"xsd": XSD,
"owl": OWL,
"time": TIME
}

# The namespace and prefixes which we will use for the metadata storage
name_space = "https://polyneme.xyz/"
pfs["polyr"] = Namespace(f"{name_space}lod/resource#")
pfs["poly-ont"] =  Namespace(f"{name_space}lod/ontology#")

## Storing Metadata
It should perhaps come as no surprise the rest of the notebook, but we will store the metadata generated in this notebook in a knowledge graph. For now, it will stay in memory as `Graph` from `rdflib`. When we publish the dataset generated in this notebook, we will upload the dataset into a graph database. 

In [108]:
def init_kg(prefixes=pfs):
    kg = Graph()
    for prefix in pfs:
        kg.bind(prefix, pfs[prefix])
    return kg
# rdf:type shortcut
a = pfs["rdf"]["type"]

# Initialize an empty graph
graph = init_kg()

## Accessing Your Local Graph Database
For this notebook, we assume you are running a `developer` (i.e., non-production) Apache Jena Fuseki triplestore as your graph database. This will be useful in several different cells.

In [109]:
db_loc = "http://localhost:3030"
dataset = "bookend"
endpoint = f"{db_loc}/{dataset}"
sparql = SPARQLWrapper(endpoint) # default location of fuseki
sparql.setReturnFormat(JSON)

# Construct the query
query = """
    PREFIX poly-ont: <https://polyneme.xyz/lod/ontology#>

    SELECT *
    WHERE {
        ?person a poly-ont:Person .
    }
    """
# Set the query
sparql.setQuery(query)

try:
    ret = sparql.queryAndConvert()
    print("Connection success.")
except Exception as e:
    print(e)

Connection success.


In [110]:
## This query will be useful
def instanceExists(uri, sparql=sparql):
    query = f"""
    PREFIX poly-ont: <https://polyneme.xyz/lod/ontology#>

    SELECT *
    WHERE {{
        <{uri}> ?p ?o .
    }}
    LIMIT 1
    """
    sparql.setQuery(query)
    try:
        ret = sparql.queryAndConvert()
        return len(ret["results"]["bindings"]) == 1
    except Exception as e:
        print(e)

## Capturing the Current Computational Environment
![computational environment](./figures/computational-environment-pattern.png)
The purpose of this is to capture the environment in which you transform data (i.e., create something new from something old). This is useful for replicability.

In [111]:
# Code to populate this pattern for this notebook

## Mint a URI for this computational environment
### There are many ways to create an identifier
### We have chosen a way that encodes some information for identifiability, without searching for the label.
comp_env_name = "polyneme.donny.home"
#### Some other examples might be
#### "polyneme.cogan.wsu"
#### "organization.name.location"
comp_env_uri = pfs["polyr"][comp_env_name]
## Check to see if the computational environment exists
if instanceExists(comp_env_uri):
    ## Check if there needs to be updates
    pass
    ## Otherwise, moveon
    pass
else:
    ## If you have done this before (i.e., this is not your first time running this notebook) AND your 
    ## computational environment hasn't changed.
    graph.add( (comp_env_uri, a, pfs["poly-ont"]["ComputationalEnvironment"]) )
    ### TODO read comp_env from config

## Computational Observations
![simulation activity](./figures/computational-observation-pattern.jpg)

In [112]:
# Code to populate this pattern for this notebook
pass

## Dataset Discovery
This is effectively the "Extract" part of the ETL loop.

In [113]:
# Code to do data set discovery
## PySat?
## HDPE.io
## CDAWeb
pass

# It is important to note that we have a record of which datasets we start with
data_inputs = list()

## Populated sample dataset names
data_inputs = ["dataset1", "dataset2"]

## This is where your code goes!
This is effectively the "Transform" part of the ETL loop.

In [114]:
pass

In [115]:
# TODO support multiple output datasets from a single script
# It is important to provide the name and location of where the data exists
# Optionally, you can give the format (e.g., csv)
result_name = "example.data"
result_loc  = "https://example.org/sample/data/loc"
result_format = "csv"
## Uniqueness is necessary!
output_uri = pfs["polyr"][f"Data.{result_name}"]
if instanceExists(output_uri):
    print("You have chosen a result name which exists in the graph database already.")
    print("If you continue, you will accidentally merge any metadata for this with any previously committed version.")

## Dataset Publishing (Internal)
Now it is time to publish your work. This is effectively the "Load" part of the ETL loop.

### Data Transformation Pattern
![data transformation pattern](./figures/data-transformation-pattern.jpg)

In [116]:
# Modeling
## Output Data
output_uri = pfs["polyr"][f"Data.{result_name}"]
graph.add((output_uri, a, pfs["poly-ont"]["Data"])) # declaration
### Associate the dataset with its actual data (i.e., the payload)
#### This reification exists in case we want to have other metadata about the location link (e.g., access permissions)
payload_uri = pfs["polyr"][f"Payload.{result_name}"]
graph.add((payload_uri, a, pfs["poly-ont"]["Payload"])) # declaration
loc_uri = URIRef(result_loc)
graph.add((output_uri, pfs["poly-ont"]["hasPayload"], payload_uri)) # association
graph.add((payload_uri, pfs["poly-ont"]["hasLocation"], loc_uri)) # association

### Associate the datatype to the dataset
format_uri = pfs["polyr"][f"DataType.{result_format}"]
graph.add( (output_uri, pfs["poly-ont"]["hasDataType"], format_uri) )

## DataTransformation

### TODO global notion of uniqueness for dt
dt_id = 1
dt_uri = pfs["polyr"][f"DataTransformation.{dt_id}"]
graph.add( (dt_uri, a, pfs["poly-ont"]["DataTransformation"]) )
graph.add( (dt_uri, pfs["poly-ont"]["occursInCE"], comp_env_uri))
### TODO spatiotemporal extent (or at least temporal extent)
### TODO association of the DataTransformation with ComputationalModelExecution

## Input Data & Data Roles
for data_input in data_inputs:
    input_uri = pfs["polyr"][f"Data.{data_input}"]
    graph.add((input_uri, a, pfs["poly-ont"]["Data"])) # declaration
    
    ## Mint a new input data role
    input_role_uri = pfs["polyr"][f"InputRole.DT{dt_id}.{data_input}"]
    graph.add((input_role_uri, a, pfs["poly-ont"]["InputRole"] )) # declaration
    graph.add((input_uri, pfs["poly-ont"]["performsInputRole"], input_role_uri)) # association
    graph.add((dt_uri, pfs["poly-ont"]["providesInputRole"], input_role_uri)) # association

## Output Data Role
output_role_uri = pfs["polyr"][f"OutputRole.DT{dt_id}.{result_name}"]
graph.add((output_role_uri, a, pfs["poly-ont"]["InputRole"] )) # declaration
graph.add((output_uri, pfs["poly-ont"]["performsOutputRole"], output_role_uri)) # association
graph.add((dt_uri, pfs["poly-ont"]["providesOutputRole"], output_role_uri)) # association
pass # prevents cell output

In [119]:
output_file = "output.ttl"
temp = g.serialize(format="turtle", encoding="utf-8", destination=output_file)

NameError: name 'g' is not defined

In [118]:
# Push to graph database
pass