## Guest Lecture COMP7230
# Spatio-temporal data manipulation in Semantic Graphs
#### by Dr Nicholas Car

This Notebook is the resource used to deliver a guest lecture for the [Australian National University](https://www.anu.edu.au)'s course [COMP7230](https://programsandcourses.anu.edu.au/2020/course/COMP7230): *Introduction to Programming for Data Scientists*. It is the second lecture in that series.

Click here to run this lecture in your web browser:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/HEAD?filepath=lecture_02.ipynb)

## 0. Lecture Outline
1. Notes about this training material
2. Spatio-Temporal data use
3. Multi-dimensional data
4. S-T typology
5. Semantic modelling
6. Using semantic S-T data

The goal of this lecture is to convey some sense of the power of semantic Web data for datascience.

## 1. Notes about this training material

* see [lecture_01.ipynb](lecture_01.ipynb) for notes on how to use this material

## 2. Spatio-Temporal data use

> _"everything is somewhere..."_

_everything is some-when_ also?

### 2.1 Spatial:
* lots of familiar tools: some free
    * Google Maps / OpenStreetMaps for simple web display
    * PANDAS + maps for data sci
    * ArcGIS / QGIS for desktop spatial data manipulation
    * Postgres + PostGIS for spatial DB work

* lots of data formats
    * geometry serialisations (GeoJSON, WKT etc)
    * spatial data files (SHAPE files, File GeoDatabases, CSV!)
    * databases (Oracle, Postgres, even NSQL, like Mongo + GeoJSON)

### 2.2 Temporal:
Most software and database systems deal with temporality, at least to some extent.

### 2.3 S-T+
* spatial/temporal data is almost always linked to non-spatial data for information
* typical workflows involve separate spatial and non-spatial operations
    * perhaps subsetting a dataset by area (spatial)
    * perhaps then by time (temporal)
    * then aggregating/transforming resultset into a final form (non-S-T)

### 2.4 Example scenario: in QGIS

Q: _find the average number of people, per dwelling, per suburb, in August 2021_

Data might be presented as _people per dwelling per 'Mesh Block' for 2011 - 2021_

Worflow might be:

1. Filter out non-August 2021 data (temporal)
2. Average people per dwelling per MeshBlock (non-spatial)
3. Intersect Mesh Blocks with suburbs ("LGAs")

These steps are necissarily sequential if we can't represent all the dimensions of our data in one system. Even if we can, we need specialised functions for some dimensions: statistical (_average_), spatial (_within_, intersections) and temporal (also _within_).

How would this scenario work with QGIS?

![](img/qgis-screenshot.png)
**Figure 1**: Screenshot of the QGIS tool showing ACT

In QGIS, we have lots of spatial operations, so can do intersections, but non-spatial operations aren't easily catered for. Typically, users export data to PANDAS, Tableau etc for more processing and pre-filter data temporally.

## 3. Multi-dimensional data

### 3.1 Intro

* before considering how to improve our operation, let's consider our previous example's dimensions

![](img/data-dimensions.png)  
**Figure 2**: Dimensions of the data in the _Example Scenario_ above

* To perform any operations on these dimensions, S-T or other, we need to know about how those dimensions "work": their typology
* If we know this, we can characterise functions per dimension
* The more dimensions we can cater for in one place, the fewer sequential steps or system people will need to use to get the results they want

### 3.2 Example scenario: in a relational DB

* Relational Databases easily cater for multi-dimensional data
* Python contains code to work with the SQLite DB "out of the box"
* SQLite is a relational DB in a single file
    * likely the "most implemented" DB in the world: every Android phone
    * don't be mislead by the single file: it can scale to billions of entries...
    
Data for our scenario in RDB-ready form:

In [None]:
print(open("lecture_02_dwellings.csv").read())

In [None]:
# use standard Python libraries
import csv
import sqlite3

# create an SQLite DB
conn = sqlite3.connect('test.db')
print("Opened database successfully")

# create a table within it
conn.execute("DROP TABLE IF EXISTS dwellings;")  # good practice to remove preexisting
conn.execute(
    """
    CREATE TABLE dwellings
    (ID INT PRIMARY KEY    NOT NULL,
    NoPeople        INT    NOT NULL,
    ContainingMB    INT   NOT NULL,
    CensusYear      INT    NOT NULL);
    """
)
print("Table created successfully")

# check it's empty
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM dwellings;")
rows = cur.fetchall()
print(f"Rows in table: {rows[0][0]}")

In [None]:
# read CSV data, insert it into table
with open("lecture_02_dwellings.csv") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for field in reader:
        conn.execute("INSERT INTO dwellings VALUES (?,?,?,?);", field)
print("Read CSV data into table")

# check there are 30 entries in table
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM dwellings;")
rows = cur.fetchall()
print(f"Rows in table: {rows[0][0]}")

In [None]:
# select only 2021 data, aggregate by containing Mesh Block
# check there are 30 entries in table
cur = conn.cursor()
cur.execute(
    """
    SELECT ContainingMB, AVG(NoPeople) 
    FROM dwellings 
    WHERE CensusYear = 2021
    GROUP BY ContainingMB
    """
)
rows = cur.fetchall()
print("Average number of people per dwelling per Mesh Block in 2021:")
for row in rows:
    print(row)

In [None]:
# clean up
conn.close()
import os
os.unlink("test.db")

### 3.3 Multi-dimensionality in SQL

```
SELECT ContainingMB, AVG(NoPeople) 
FROM dwellings 
WHERE CensusYear = 2021
GROUP BY ContainingMB
```

* `AVG(NoPeople)` - a statistical dimension operation
* `CensusYear = 2021` - a temporal dimension operation
* `GROUP BY ContainingMB` - using spatial information, MB, but a statistical operation


## 3. S-T typology

Let's look in detail at spatial and temporal dimnesions...

### 3.1 Spatial

* spatial typology can be called topology
* _topology_ has a robust, formal set of models
    * [Dimensionally Extended 9-Intersection Model (DE-9IM)](https://en.wikipedia.org/wiki/DE-9IM)
    * [Simple Features](https://en.wikipedia.org/wiki/Simple_Features) [IS1, IS2]
* many tools implement functions for these models
    * PostGIS
    * Oracle
    * QGIS
    * Python (e.g. shapely)
    * GeoSPARQL implementations (next)

Topological function examples:

![](img/TopologicSpatialRelarions2.png)
**Figure 2**: Topological spatial relations examples
By Krauss - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21299138

Example of [shapely](https://pypi.org/project/Shapely/) implementing _contains_:

In [None]:
from shapely.geometry import Polygon, Point

poly = Polygon([(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)])
point = Point(0.5, 0.5)  # contained
poly.contains(point)

In [None]:
point2 = Point(1.5, 0.5)  # not contained
poly.contains(point2)

* How are these functions used?
    * for queries against spatial data collections
        * search for things with geometries containing a point using the Loc-I _Explorer_: https://explorer.loci.cat/
    * pre-calculating relations, e.g. the Feature [Statistical Area 2_404011096](https://linked.data.gov.au/dataset/asgs2016/statisticalarealevel2/404011096) has a number of pre-calculated topological relations [given in machine-readable form](https://linked.data.gov.au/dataset/asgs2016/statisticalarealevel2/404011096?_view=loci&_format=text/turtle)
    
> _**ASIDE**: many domains need custom topological relations, such as hydrological catchments which have a special 'downstream' relation_ relevant for flood calculations

### 3.2 Temporal

* temporal data is everywhere
    * it's less visible than spatial, because it's simpler - 1D v. 2D/3D- and handled invisibly
    * still important to consider deeply
* simpler topology: Allen relations [A83]

![](img/allen-relations.png)
**Figure 3**: Allen relations
_from https://www.ics.uci.edu/~alspaugh/cls/shr/allen.html_

* implemented in many date/time tools
    * e.g. Python's datetime library:

In [None]:
import datetime

a = datetime.datetime(2021, 9, 7)
b = datetime.datetime(2021, 10, 7)  # the day after a, above

# before(a, b)
print(a < b)  

In [None]:
# after(a, b)
print(a > b)  

## 4. Semantic modelling

### 4.1 RDF Recap

_some short recapping of how the Resource Description Framework (RDF) works_

For discussion:

![](img/example-graph-iris.jpg)  
**Figure 4**: RDF image, from [the RDF Primer](https://www.w3.org/TR/rdf11-primer/)

Note that:
* _everything_ is "strongly" identified
    * including all relationships
    * unlike lots of related data
* many of the identifiers resolve
    * to more info (on the web)

### 4.2 Semantic Modelling

We can use RDF to store both the _models_ we use and the _data_. This is different to relational and NoSQL DBs whos models are DB schema and separate from the data.

An extension on the last lecture:

![](img/rdf-model-data.png)  
**Figure 5**: RDF representing both model and data information

With the model and the data in one place, we can use the them together. Using the model and data above, a pseudo code query to find "Nick" could be:

> _find all the objects of type `ex:Person` with the `ex:name` property equal to "Nick"_

In [SPARQL](https://www.w3.org/TR/sparql11-query/) this would be:

```
PREFIX ex: <http://example.com/>

SELECT *
WHERE {
    ?p a ex:Person ;
        ex:name "Nick" ;
    .
}
```

To demo:

In [None]:
# show the data file content
print(open("lecture_02_person.ttl").read())

In [None]:
# use the rdflib library
from rdflib import Graph

# load the RDF file into a graph
g = Graph().parse("lecture_02_person.ttl")
print(f"The number of triples in the graph is {len(g)}")

# query the graph
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    
    SELECT *
    WHERE {
        ?p a ex:Person ;
            ex:name "Nick" ;
        .
    }
    """
):
    print(f"The ID of the node of type ex:Person with ex:name \"Nick\" is {r[0]}")


Extend the above query to _find all the objects of type `ex:Thing`:

In [None]:
# query the graph
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT *
    WHERE {
        ?p rdf:type/rdfs:subClassOf ex:Thing .
    }
    """
):
    print(f"The ID of nodes of any type that is a sub class of ex:Thing: {r[0]}")

`rdf:type/rdfs:subClassOf` in the query above is a 'path' query: one of the graph model superpowers!

### 4.3 Reasoning

According to it's model's rules, which follow general set theory, any `x` that is of type `y` is also of type of any superclass of `y`. So for our data we can infer this relation:

![](img/rdf-reasoning.png)  
**Figure 6**: RDF reasoning: using the model within the data

We can execute software that "builds" data according to model rules to simplify queries. Here I'll just run the rule in isolation:

In [None]:
g.update(
    """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    INSERT {
        ?x rdf:type ?z .
    }
    WHERE {
        ?x rdf:type ?y .
        ?y rdfs:subClassOf ?z .
    }
    """
)

# show the updated data, look for orcid:8742-7730's types
print(g.serialize())

Now we can use the simpler query:

In [None]:
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT *
    WHERE {
        ?p rdf:type ex:Thing .
    }
    """
):
    print(f"The ID of nodes of any type that is a sub class of ex:Thing: {r[0]}")

## 5. Semantic S-T modelling

### 5.1 GeoSPARQL & OWL Time

We have well-known _ontologies_ for spatial and temporal domains:

* GeoSPARQL: [GSP] - spatial
* OWL Time: [TIME] - temporal

Using them and representing both mdoels and data:

![](img/semantic-st-modelling.png)  
**Figure 7**: (left) GeoSPARQL `geo:Feature` and `geo:Geometry` classes and their relations & OWL Time `time:TemporalEntity` (no `Feature` equivalent); (center) A flood, `ex:flood-1` modelled with both spatial and temporal properties; (right) the inference that `ex:flood-1` is a `geo:Feature`, from its use of `geo:hasGeometry`

### 5.2 Typology

GeoSPARQL contain spatial and temporal relations as per _Simple Features_ and _Allen Relations_ respectively:

Ontology | Relation | Property
:--- | --- | ---
GeoSPARQL | SF within | `geo:sfWithin`
| SF contains | `geo:sfContains`
| EH covers | `geo:ehCovers`
OWL Time | before | `time:before`
| after | `time:after`
| in | `time:intervalIn`

Following on from Figure 7: given that any object with spatial relations is a `geo:Feature`, if `ex:flood-1 geo:ehCovers ex:school-grounds-x` then `ex:flood-1 rdfs:type geo:Feature` .

Both ontologies contain other property rules, e.g. `time:before` transitivity & `time:before`/`time:after` inverse:

```
ex:IceAge time:before ex:Renaissance .

ex:20thCentury time:after ex:Renaissance .
```
(inverse)

```
ex:IceAge time:before ex:Renaissance .

ex:Renaissance time:before ex:20thCentury .
```
(transitive)
```
ex:IceAge time:before ex:20thCentury .
```

### 5.3 Functions

GeoSPARQL contains functions to calculate relations, e.g. `geo:contains`, using data, as per the example in 3.1. The function `geof:contains()` is used like this, to find all the things that `ex:feature-x` contains, spatially:

```
PREFIX geo: <http://www.opengis.net/ont/geosparql#>

SELECT ?contained
WHERE {
    ex:feature-x geo:hasGeometry ?ga ;    
    ?contained geo:hasGeometry ?gb ;
    
    FILTER geo:contains(?ga, ?gb)
}
```

OWL Time doesn't contain equivalent functions... but I've made some [TMF]! Let's see...

Example data:

In [None]:
print(open("lecture_02_tf.ttl").read())

The above data is shown in Figure 8.

![](img/tf-rdf.png)  
**Figure 8**: A set of `time:TemporalEntity` instances with declared and inferrable/calculable relations (in orange)

In [None]:
from rdflib import Graph
from timefuncs import TFUN

g = Graph().parse("lecture_02_tf.ttl")
print(f"The number of triples in the graph is {len(g)}")

In [None]:
for r in g.query(
        """
        PREFIX tfun: <https://w3id.org/timefuncs/>
        
        SELECT ?x ?y
        WHERE {
            ?x a ?c1 .
            ?y a ?c2 .
        
            FILTER tfun:isBefore(?x, ?y)
        }
        """
):
    print(f"{r[0]} is before {r[1]}")

Note that this query uses both graph path following logic and numerical (date) calculations.

## 6.0 Using semantic S-T data

A "semantically ehanced" version of the dwellings CSV used in the SQLite example above:

In [None]:
print(open("lecture_02_dwellings.ttl").read())

Enhanced in that:

* element IDs have been univerallised:
    * 50055290000 &rarr; `https://linked.data.gov.au/dataset/asgs2016/50055290000`
* simple data types runed into nodes with properties:
    * "2016" &rarr; `ex:year-2016`
* spatial & temporal topology hase been introduced:   
    * ```
    <https://linked.data.gov.au/dataset/asgs2016/50055290000>
        a geo:Feature ;
        geo:sfTouches <https://linked.data.gov.au/dataset/asgs2016/50049040000> ;
    .
    ```

    * ```
    ex:year-2021 
        a time:temporalEntity ;
        time:after ex:year-2016 ;
        time:inXSDgYear 2021 ;
    .
    ```

## Conclusions

What have we learned?

## References

* **[A83]**: Allen, James F. "Maintaining knowledge about temporal intervals". Communications of the ACM 26(11) pp.832-843, Nov. 1983.
* **[DEM9]**: Clementini E., Di Felice P., van Oosterom P. (1993) "A small set of formal topological relationships suitable for end-user interaction". In: Abel D., Chin Ooi B. (eds) Advances in Spatial Databases. SSD 1993. Lecture Notes in Computer Science, vol 692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56869-7_16
* **[GSP]**: Open Geospatial Consortium "OGC GeoSPARQL - A Geographic Query Language for RDF Data". Implementation standard (draft). https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html
* **[IS1]**: International Organization for Standardization "ISO 19125-1:2004 Geographic information -- Simple feature access -- Part 1: Common architecture". International standard.
* **[IS2]**: International Organization for Standardization "ISO 19125-2:2004 Geographic information -- Simple feature access -- Part 2: SQL option". International standard.
* **[TIM]** Cox, Simon & Little, Chris (eds.) "Time Ontology in OWL". W3C Candidate Recommendation 26 March 2020. https://www.w3.org/TR/owl-time/
* **[TMF]** Car, N.J. 2021. "RDFlib OWL TIME Functions". Software library online. https://github.com/rdflib/timefuncs