## Guest Lecture COMP7230
# Spatio-temporal data manipulation in Knowledge Graphs
#### by Dr Nicholas Car

This Notebook is the resource used to deliver a guest lecture for the [Australian National University](https://www.anu.edu.au)'s course [COMP7230](https://programsandcourses.anu.edu.au/2020/course/COMP7230): *Introduction to Programming for Data Scientists*. It is the second guest lecture in that series.

Click here to run this lecture in your web browser:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/HEAD?filepath=lecture_02.ipynb)

## 0. Lecture Outline
1. [Notes about this training material](#sec-1)
2. [Spatio-Temporal data use](#sec-2)
3. [Multi-dimensional data](#sec-3)
4. [S-T typology](#sec-4)
5. [Semantic modelling](#sec-5)
6. [Semantic S-T modelling](#sec-6)
7. [Using semantic S-T data](#sec-7)
8. [Conclusion / Suggested Actions](#sec-8)
9. [References](#sec-9)

The goal of this lecture is to convey some sense of the power of semantic Web data for datascience.

<a id="sec-1"></a>
## 1. Notes about this training material

* this lecture is delivered using a [Jupyter Notebook](https://jupyter.org/)
* see [Lecture 1's Notebook](lecture_01.ipynb) for notes on how to use this material in detail

<a id="sec-2"></a>
## 2. Spatio-Temporal data use

> _"everything is somewhere..."_

everything is _some-when_ also?

### 2.1 Spatial
* lots of familiar tools: some free
    * Google Maps / OpenStreetMaps for simple web display
    * PANDAS + maps for data sci
    * ArcGIS / QGIS for desktop spatial data manipulation
    * Oracle spatial or Postgres + PostGIS for spatial DB work

* lots of data formats
    * geometry serialisations (GeoJSON, WKT etc)
    * spatial data files (SHAPE files, File GeoDatabases, CSV!)
    * databases (Oracle, Postgres, even NSQL, like Mongo + GeoJSON)

### 2.2 Temporal
Most software and database systems deal with temporality, at least to some extent.

### 2.3 S-T+
* spatial/temporal data is almost always linked to non-spatial data for information
* typical workflows involve separate spatial and non-spatial operations
    * perhaps subsetting a dataset by area (spatial)
    * perhaps then by time (temporal)
    * then aggregating/transforming resultset into a final form (non-S-T)

### 2.4 Example scenario: in QGIS

Q: _find the average number of people, per dwelling, per suburb, in August 2021_

Data might be presented as _people per dwelling per 'Mesh Block' for 2011 - 2021_

Worflow might be:

1. Filter out non-August 2021 data (temporal)
2. Average people per dwelling, per MeshBlock (non-spatial)
3. Intersect Mesh Blocks with suburbs ("LGAs")

These steps are necessarily sequential if we can't represent all the dimensions of our data in one system. Even if we can, we need specialised functions for some dimensions: statistical (_average_), spatial (_within_, intersections) and temporal (also _within_).

How would this scenario work with QGIS?

![](lecture_resources/img/qgis-screenshot.png)
**Figure 1**: Screenshot of the QGIS tool showing ACT

In QGIS, we have lots of pre-loaded spatial operations, and also basic statistical operations such as 'average', but custom non-spatial operations aren't so easily catered for. 

QGIS and siliar systems also have scripting abilities:

![](lecture_resources/img/qgis-scripting.png)
**Figure 2**: Screenshot of the QGIS tool showing Python scripting. After <https://www.qgistutorials.com/en/docs/getting_started_with_pyqgis.html>

```
for f in layer.getFeatures():
  geom = f.geometry()
  print(f"{f['name']}, {f['iata_code']}, {geom.asPoint().y()}, {geom.asPoint().x()}")
```

However, typically, users export data to PANDAS, Tableau etc for detailed processing.

<a id="sec-3"></a>
## 3. Multi-dimensional data

* This demo data is about Meshblocks which are census counting areas within the Australian Bureau of Statistics' _Australian Statistitcal Geographies Standard_, which is online in Linked Data forma at <https://asgs.linked.fsdf.org.au>

### 3.1 Theory

Consider this data:

```
Table: 2011
-----------
ID Ppl MB
01 3   50055290000
02 5   50055290000
03 1   50055290000
...

Table 2016
-----------
ID Ppl MB
11 3   50055290000
12 3   50055290000
...
```

We have _dimensions_ in columns.

![](lecture_resources/img/data-dimensions.png)  
**Figure 3**: Dimensions of the data in the _Example Scenario_ above

> **_What are dimensions?_**
>
> **Business**: "a set of data attributes pertaining to something of interest to a business"
>
> **Scientific/modelling**: "orthogonal projections of value"
>
> Datasets are often realised in "hypercubes" with multiple dimensions

* To perform any operations on these dimensions, S-T or other, we need to know about how those dimensions "work": their typology
* If we know this, we can characterise functions per dimension
* The more dimensions we can cater for in one place, the fewer sequential steps or system people will need to use to get the results they want

### 3.2 Example scenario: in a relational DB

* Relational Databases easily cater for multi-dimensional data
* SQLite is a relational DB in a single file
    * likely the "most implemented" DB in the world: every Android and Apple phone
    * don't be misled by the single file: it can scale to billions of entries...
* Python contains code to work with the SQLite DB "out of the box"

> NOTE: relational databases are awesome! My opinion is that most data science students would benefit from better relational DB training. RDB's are not particularly hard to use, are widely used and can do lots of things, but sometimes seem forgotten in the rush to demonstrate use of the latest Python data science package...

Data for our scenario in RDB-ready form:

In [1]:
print(open("lecture_resources/lecture_02_dwellings.csv").read())

ID,NoPeople,ContainingMB,CensusYear
01,3,50055290000,2011
02,5,50055290000,2011
03,1,50055290000,2011
04,1,50055290000,2011
05,3,50049040000,2011
06,6,50049040000,2011
07,3,50049040000,2011
08,1,50049040000,2011
09,5,50049040000,2011
10,3,50049040000,2011
11,3,50055290000,2016
12,3,50055290000,2016
13,3,50055290000,2016
14,1,50055290000,2016
15,2,50049040000,2016
16,6,50049040000,2016
17,3,50049040000,2016
18,1,50049040000,2016
19,4,50049040000,2016
20,3,50049040000,2016
21,3,50055290000,2021
22,4,50055290000,2021
23,1,50055290000,2021
24,2,50055290000,2021
25,2,50049040000,2021
26,7,50049040000,2021
27,3,50049040000,2021
28,1,50049040000,2021
29,5,50049040000,2021
30,3,50049040000,2021


This data is in simple Comma-Seperated Value (CSV) form. There are many ways to load it into SQLite but I'll take a fairly simple approach below of just creating a table that matches the column content of the data and then I'll insert it into the table, by reading the file line-by-line.

In [2]:
# use standard Python libraries
import csv
import sqlite3

# create an SQLite DB
conn = sqlite3.connect('test.db')
print("Opened database successfully")

# create a table within it
conn.execute("DROP TABLE IF EXISTS dwellings;")  # good practice to remove preexisting
conn.execute(
    """
    CREATE TABLE dwellings
    (ID INT PRIMARY KEY    NOT NULL,
    NoPeople        INT    NOT NULL,
    ContainingMB    INT   NOT NULL,
    CensusYear      INT    NOT NULL);
    """
)
print("Table created successfully")

# check it's empty
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM dwellings;")
rows = cur.fetchall()
print(f"Rows in table: {rows[0][0]}")  # should be 0

Opened database successfully
Table created successfully
Rows in table: 0


In [3]:
# read CSV data, insert it into table
with open("lecture_resources/lecture_02_dwellings.csv") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for field in reader:
        conn.execute("INSERT INTO dwellings VALUES (?,?,?,?);", field)
print("Read CSV data into table")

# check there are 30 entries in table
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM dwellings;")
rows = cur.fetchall()
print(f"Rows in table: {rows[0][0]}")  # should be 30

Read CSV data into table
Rows in table: 30


With that data loaded, let's naively query for the average number of people per MB:

In [4]:
# select only 2021 data, aggregate by containing Mesh Block
cur = conn.cursor()
cur.execute(
    """
    SELECT ContainingMB, AVG(NoPeople) 
    FROM dwellings 
    WHERE CensusYear = 2021
    GROUP BY ContainingMB
    """
)
rows = cur.fetchall()
print("Average number of people per dwelling per Mesh Block in 2021:")
print()
for row in rows:
    print(f"{row[0]}, {row[1]}")

Average number of people per dwelling per Mesh Block in 2021:

50049040000, 3.5
50055290000, 2.5


### 3.3 Multi-dimensionality in SQL

The query used above was:

```
SELECT ContainingMB, AVG(NoPeople) 
FROM dwellings 
WHERE CensusYear = 2021
GROUP BY ContainingMB
```

Its dimensions:

* `AVG(NoPeople)` - a statistical dimension operation
* `CensusYear = 2021` - a temporal dimension operation
* `GROUP BY ContainingMB` - using spatial information, MB, but a statistical operation

We are able to pose this query because the dimentions here and their scales are simple!

If we had different data, perhaps with more complex spatial information, we might have to have detailed spatial knowledge to use it ieffectively. E.g.:

```
SELECT AVG(NoPeople) 
FROM dwellings 
WHERE CensusYear = 2021
AND ContainingMB NEXT TO MB1234
GROUP BY ContainingMB
```

`NEXT TO ...` is not real SQL syntax.

### 3.4 SQL and Data Frames

Relational DBs can store lots of information and make subsets of it that you want to report/graph available to you via SQL queries.

A lot of data science is now done using Python tooling that manipulates data in-memory as this allows data scientists to run scripts, not servers. [PANDAS](https://pandas.pydata.org/about/) is such a tool.

Exporting SQLite data to a PANDAS _DataFrame_ using Python is incredibly easy since PANDAS has SQL querying built in.

Let's subset the data by getting just 2021 results using SQL and put it in a data frame:

In [5]:
# using our DB established above,
# query the DB, load results into a dataframe
import pandas as pd
df = pd.read_sql("SELECT * FROM dwellings WHERE CensusYear = 2021", conn)

print(df.to_string())

   ID  NoPeople  ContainingMB  CensusYear
0  21         3   50055290000        2021
1  22         4   50055290000        2021
2  23         1   50055290000        2021
3  24         2   50055290000        2021
4  25         2   50049040000        2021
5  26         7   50049040000        2021
6  27         3   50049040000        2021
7  28         1   50049040000        2021
8  29         5   50049040000        2021
9  30         3   50049040000        2021


We can subset the DB any way we like, create a DataFrame, do lots of calculatons with it as well as plotting etc., as we did in Lecture 1.

Let's find the average number of people per dwelling, per Mesh Block using PANDAS:

In [6]:
df.groupby("ContainingMB")["NoPeople"].mean()

ContainingMB
50049040000    3.5
50055290000    2.5
Name: NoPeople, dtype: float64

In [7]:
# remove up our DB to clean up
conn.close()
import os
os.unlink("test.db")

<a id="sec-4"></a>
## 4. S-T typology

Lucky for us, the data to date has been pre-aligned per spatial & temporal feature (MB & year)

Let's look in detail at spatial and temporal dimensions...

### 4.1 Spatial

* spatial typology can be called topology
* _topology_ has a robust, formal set of models
    * [Dimensionally Extended 9-Intersection Model (DE-9IM)](https://en.wikipedia.org/wiki/DE-9IM)
    * [Simple Features](https://en.wikipedia.org/wiki/Simple_Features) [IS1, IS2]
* many tools implement functions for these models
    * PostGIS
    * Oracle
    * QGIS
    * Python (e.g. the Shapely library
    * GeoSPARQL implementations (next)

Topological function examples:

![](lecture_resources/img/TopologicSpatialRelarions2.png)
**Figure 4**: Topological spatial relations examples
By Krauss - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21299138

Example of [Shapely](https://pypi.org/project/Shapely/) implementing _contains_:

In [8]:
from shapely.geometry import Polygon, Point

poly = Polygon([(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)])
point = Point(0.5, 0.5)  # contained
poly.contains(point)

True

In [9]:
point2 = Point(1.5, 0.5)  # not contained
poly.contains(point2)

False

* How are these functions used?
    * for queries against spatial data collections
        * try geometry-based searching on the Indigenous Data Network's Spatial Data Catalogue: <https://data.idnau.org/s>
    * pre-calculating relations, e.g. the Feature [Statistical Area 2 404011096](https://linked.data.gov.au/dataset/asgsed3/SA2/404011096) has a number of pre-calculated topological relations [given in machine-readable form](http://asgs.linked.fsdf.org.au/dataset/asgsed3/collections/SA2/items/404011096?_profile=geo&_mediatype=text/turtle)
    
> _**ASIDE**: many domains need custom topological relations, such as hydrological catchments which have a special 'downstream' relation_ relevant for flood calculations

### 4.2 Temporal

* temporal data is everywhere
    * it's less visible than spatial, because it's simpler - 1D v. 2D/3D - and somewhat handled invisibly by lots of software
    * still important to consider deeply
* simpler topology: Allen relations [A83]

![](lecture_resources/img/allen-relations.png)
**Figure 5**: Allen relations
_from https://www.ics.uci.edu/~alspaugh/cls/shr/allen.html_

* implemented in many date/time tools
    * e.g. Python's datetime library:

In [10]:
import datetime

a = datetime.datetime(2021, 9, 7)
b = datetime.datetime(2021, 10, 7)  # the day after a, above

# before(a, b)
print(a < b)  

True


In [11]:
# after(a, b)
print(a > b)  

False


Python's mathematical more/less, greater/smaller functions interpreted as after/before for 1D time.

<a id="sec-5"></a>
## 5. Semantic modelling

### 5.1 RDF Recap

_A short recapping of how the Resource Description Framework (RDF) works_

For discussion:

![](lecture_resources/img/example-graph-iris.jpg)  
**Figure 6**: RDF image, from [the RDF Primer](https://www.w3.org/TR/rdf11-primer/)

Note that:
* _everything_ is "strongly" identified
    * including all relationships
    * unlike lots of related data
* many of the identifiers resolve
    * to more info (on the web)

### 5.2 Semantic Modelling

We can use RDF to store both the _models_ we use and the _data_. This is different to relational and NoSQL DBs whos models are DB schema and separate from the data.

An extension on the last lecture:

![](lecture_resources/img/rdf-model-data.png)  
**Figure 7**: RDF representing both model and data information

With the model and the data in one place, we can use the them together. Using the model and data above, a pseudo code query to find "Nick" could be:

> _Find all the objects of type `ex:Person` with the `ex:name` property equal to "Nick"_

In [SPARQL](https://www.w3.org/TR/sparql11-query/) this would be:

```
PREFIX ex: <http://example.com/>

SELECT *
WHERE {
    ?p a ex:Person ;
        ex:name "Nick" ;
    .
}
```

To demo:

In [12]:
# show the data file content
print(open("lecture_resources/lecture_02_person.ttl").read())

PREFIX ex: <http://example.com/>
PREFIX orcid: <https://orcid.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

#
# model
#
ex:Thing
    a rdfs:Class ;
.

ex:Person
    a rdfs:Class ;
    rdfs:subClassOf ex:Thing ;
.

ex:name
    a rdf:Property ;
    rdfs:domain ex:Person ;
    rdfs:range ex:string ;
.

ex:string
    a rdfs:Datatype ;
.

#
# data
#
orcid:8742-7730
    a ex:Person ;
    ex:name "Nick" ;
.



In [13]:
# use the rdflib library
from rdflib import Graph

# load the RDF file into a graph
g = Graph().parse("lecture_resources/lecture_02_person.ttl")
print(f"The number of triples in the graph is {len(g)}")  # should be 9

The number of triples in the graph is 9


In [14]:
# query the graph
print(f"The ID of the node of type ex:Person with ex:name \"Nick\" is:")
print()
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    
    SELECT *
    WHERE {
        ?p a ex:Person ;
            ex:name "Nick" ;
        .
    }
    """
):
    print(f"* {r[0]}")

The ID of the node of type ex:Person with ex:name "Nick" is:

* https://orcid.org/8742-7730


Extend the above query to find all the objects of type `ex:Thing`:

In [15]:
# query the graph
print("The ID of nodes of any type that is a sub class of ex:Thing:")
print()
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT *
    WHERE {
        ?p rdf:type/rdfs:subClassOf ex:Thing .
    }
    """
):
    print(f"* {r[0]}")

The ID of nodes of any type that is a sub class of ex:Thing:

* https://orcid.org/8742-7730


> `rdf:type/rdfs:subClassOf` in the query above is a 'path' query: one of the graph model superpowers!

We have done this little demo to indicate how Knowlege Graph queries can navigate over edges.

### 5.3 Reasoning

According to it's model's rules, which follow general set theory, any `x` that is of type `y` is also of type of any superclass of `y`. So for our data we can infer this relation:

![](lecture_resources/img/rdf-reasoning.png)  
**Figure 8**: RDF reasoning: using the model within the data

We can execute software that "builds" data according to model rules to simplify queries. 

Building new data, in RDF Knowledge Graphs, often comes down to adding new nodes or edges to the graph.

Let's add the edge for the above:

In [16]:
g.update(
    """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    INSERT {
        ?x rdf:type ?z .
    }
    WHERE {
        ?x rdf:type ?y .
        ?y rdfs:subClassOf ?z .
    }
    """
)

# show the updated data, look for orcid:8742-7730's types
print(g.serialize())

@prefix ex: <http://example.com/> .
@prefix orcid: <https://orcid.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Person a rdfs:Class ;
    rdfs:subClassOf ex:Thing .

ex:Thing a rdfs:Class .

ex:name a rdf:Property ;
    rdfs:domain ex:Person ;
    rdfs:range ex:string .

orcid:8742-7730 a ex:Person,
        ex:Thing ;
    ex:name "Nick" .

ex:string a rdfs:Datatype .




We can see above that `orcid:8742-7730` is now of type `ex:Thing` as well as the original `ex:Person`.

Now we can use the simpler query:

In [17]:
print("The ID of nodes of any type that is a sub class of ex:Thing: ")
print()
for r in g.query(
    """
    PREFIX ex: <http://example.com/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT *
    WHERE {
        ?p rdf:type ex:Thing .
    }
    """
):
    print(f"* {r[0]}")

The ID of nodes of any type that is a sub class of ex:Thing: 

* https://orcid.org/8742-7730


<a id="sec-6"></a>
## 6. Semantic S-T modelling

_In this section we are going to step through common Semantic Web ways of modelling spatiality and temporality, based on the typologies described above. Just not that there are many ways of doing things in the Semantic Web and while these methods are common, there are other ways for other purposes._

### 6.1 GeoSPARQL & OWL Time

We have well-known, standardised and free, _ontologies_ for spatial and temporal domains:

* GeoSPARQL: [GSP] - spatial
* OWL Time: [TIME] - temporal

Using them and representing both mdoels and data:

![](lecture_resources/img/semantic-st-modelling.png)  
**Figure 9**: (left) GeoSPARQL `geo:Feature` and `geo:Geometry` classes and their relations & OWL Time `time:TemporalEntity` (no `Feature` equivalent); (center) A flood, `ex:flood-1` modelled with both spatial and temporal properties; (right) the inference that `ex:flood-1` is a `geo:Feature`, from its use of `geo:hasGeometry`

### 6.2 Typology

GeoSPARQL & OWL TIME contain spatial and temporal relations as per _Simple Features_ and _Allen Relations_ respectively:

Ontology | Relation | Property
:--- | --- | ---
GeoSPARQL | SF within | `geo:sfWithin`
&nbsp; | SF contains | `geo:sfContains`
&nbsp; | EH covers | `geo:ehCovers`
OWL Time | before | `time:before`
&nbsp; | after | `time:after`
&nbsp; | in | `time:intervalIn`

Following on from Figure 9: given that any object with spatial relations is a `geo:Feature`, if `ex:flood-1 geo:ehCovers ex:school-grounds-x` then `ex:flood-1 rdfs:type geo:Feature` .

Both ontologies contain other property rules, e.g. `time:before` transitivity & `time:before`/`time:after` inverse. Let's perform some simple reasoning using these two OWL Time rules:

Given:
```
ex:IronAge time:before ex:Renaissance .

ex:20thCentury time:after ex:Renaissance .
```
Using `time:before`/`time:after` inverse:
```
ex:IronAge time:before ex:Renaissance .

ex:Renaissance time:before ex:20thCentury .
```
Using `time:before` transitivity:
```
ex:IronAge time:before ex:20thCentury .
```
So a Semantic database loaded with OWL Time's rules would automatically calculate the last statement from the given statements and thus you could query the database like this and get a result:

```
PREFIX ex: <http://example.com/>
PREFIX time: <http://www.w3.org/2006/time#>

SELECT * 
WHERE {
    ex:IronAge time:before ?x 
}
```

...and you would get: `?x` = `ex:Renaissance`, `ex:20thCentury`. The results from given and calculated data are not normally differentiated, although many systems allow you to indicate given/calculated if you want.

Let's run that example in code.

In [18]:
# parse given data into a graph
from rdflib import Graph

g = Graph().parse(
    data="""
    PREFIX ex: <http://example.com/>
    PREFIX time: <http://www.w3.org/2006/time#>

    ex:IronAge time:before ex:Renaissance .

    ex:20thCentury time:after ex:Renaissance .
    """,
    format="turtle"
)
# print no. triples in the graph
print(len(g))  # should be 2

2


In [19]:
# manually run each OWL Time rule
# time:before/time:after inverse
g.update(
    """
    PREFIX time: <http://www.w3.org/2006/time#>

    INSERT {
        ?y time:after ?x .
        ?n time:before ?m .
    }
    WHERE {
        ?x time:before ?y .
        ?m time:after ?n .
    }
    """
)
# print no. triples in the graph
print(len(g))  # should be 4

4


In [20]:
# time:before transitivity
g.update(
    """
    PREFIX time: <http://www.w3.org/2006/time#>

    INSERT {
        ?x time:before ?y
    }
    WHERE {
        ?x time:before+ ?y .
    }
    """
)
# print no. triples in the graph
print(len(g))  # should be 5

5


In [21]:
# print total graph
print(g.serialize())

@prefix ex: <http://example.com/> .
@prefix time: <http://www.w3.org/2006/time#> .

ex:IronAge time:before ex:20thCentury,
        ex:Renaissance .

ex:20thCentury time:after ex:Renaissance .

ex:Renaissance time:after ex:IronAge ;
    time:before ex:20thCentury .




In [22]:
# ask the query
for r in g.query(
        """
        PREFIX ex: <http://example.com/>
        PREFIX time: <http://www.w3.org/2006/time#>

        SELECT *
        WHERE {
            ex:IronAge time:before ?x
        }
        """
    ):
    print(f"<{r[0]}>")  # should be 2

<http://example.com/Renaissance>
<http://example.com/20thCentury>


### 6.3 Functions

GeoSPARQL contains functions to calculate relations, e.g. `geo:contains`, using data, as per the example in 3.1. The function `geof:contains()` is used like this, to find all the things that `ex:feature-x` contains, spatially:

```
PREFIX geo: <http://www.opengis.net/ont/geosparql#>

SELECT ?contained
WHERE {
    ex:feature-x geo:hasGeometry ?ga ;    
    ?contained geo:hasGeometry ?gb ;
    
    FILTER geo:contains(?ga, ?gb)
}
```

OWL Time doesn't contain equivalent functions... but I've made some [TMF]! Let's see...

Example data:

In [None]:
print(open("lecture_resources/lecture_02_tf.ttl").read())

The above data is shown in Figure 8.

![](lecture_resources/img/tf-rdf.png)  
**Figure 10**: A set of `time:TemporalEntity` instances with declared and inferrable/calculable relations (in orange)

In [None]:
from rdflib import Graph
from timefuncs import TFUN

g = Graph().parse("lecture_resources/lecture_02_tf.ttl")
print(f"The number of triples in the graph is {len(g)}")  # should be 9

In [None]:
for r in g.query(
        """
        PREFIX tfun: <https://w3id.org/timefuncs/>
        
        SELECT ?x ?y
        WHERE {
            ?x a ?c1 .
            ?y a ?c2 .
        
            FILTER tfun:isBefore(?x, ?y)
        }
        """
):
    print(f"{r[0]} is before {r[1]}")  # should be 4

Note that this query uses both graph path following logic and numerical (date) calculations.

<a id="sec-7"></a>
## 7. Using semantic S-T data

### 7.1 Semantically Enhanced Dwelling data

A "semantically enhanced" version of the dwellings CSV used in the SQLite example above:

In [None]:
print(open("lecture_resources/lecture_02_dwellings.ttl").read())

Enhanced in that:

* element IDs have been universalised with the addition of globally unique, Internet, namespaces:
    * 50055290000 &rarr; `https://linked.data.gov.au/dataset/asgsed3/MB/50055290000`
    * using a reference to the namespace, `mb` == `https://linked.data.gov.au/dataset/asgsed3/MB/`:
        * 50055290000 &rarr; `mb:50055290000`

> _This namespace and the item is rea real and it resolves! Try: https://linked.data.gov.au/dataset/asgsed3/MB/50055290000_

* simple data types turned into RDF nodes with their own properties:
    * "2016" &rarr; `ex:year-2016`
* spatial & temporal topology hase been introduced:   
    * ```
    mb:50055290000
        a geo:Feature ;
        geo:sfTouches mb:50049040000 ;
    .
    ```

    * ```
    ex:year-2021 
        a time:temporalEntity ;
        time:after ex:year-2016 ;
        time:inXSDgYear 2021 ;
    .
    ```
* data relationships (the model used) are given, see next

The header and first data row from the CSV version of the data looked like this:

```
ID,NoPeople,ContainingMB,CensusYear
01,3,50055290000,2011
```

The Semantic version's first row equivalent looks like this:

```
ex:obs-01
    a ex:Observation ;
    ex:containingMB mb:50055290000 ;
    ex:hasNoPeople 3 ;
    time:hasTime ex:year-2011 .
```

The informal linking of data in columns, in the CSV "3" to "NoPeople", via the column heading, has been replaced with a formal, defined, relationship which here is `ex:hasNoPeople 3`. The `ex:hasNoPeople` property is defined now (not left to a human's interpretation of the text "NoPeople") and added to the RDF dwellings file like this:

```
ex:hasNoPeople
    a rdf:Property ;
    rdfs:label "has nbumber of people" ;
    rdfs:domain ex:Observation ;
    rdfs:range xsd:integer ;
.
```

So `ex:hasNoPeople` 'means' a property which is labelled "has number of people" and which must (only) be applied to records fo type `ex:Observation` - a so-called 'domain' constraint - and which can only result in an `xsd:integer` - the standard integer data type, a 'range' constraint. What _has number of people_ actually means can be explained in further descriptions and even relations to other similar properties, perhaps general counts of things, if desired.

Our Semantic dwellings can be visually presented as per Fig. 9 below:

![](lecture_resources/img/sem-dwellings-02.png)
**Figure 11**: The dimensions of the semantically enhanced version of the dwellings data. The property indicating the number of people counted per observation, per dwelling, is a statistical property. Spatial and Temporal dimensions are dealt with via complex objects of an _Observed feature_ and an _Observed time period_ respectively, related to the Observation. The specific values of _Mesh Block_ and _Year_ for the _Observed feature_ and an _Observed time period_ are not all we have: we also have relations to other features and time periods, e.g. Mesh Block 50055290000 touching MeshBlock 50049040000 which is known via other properties of the Mesh Blocks themselves.

### 7.2 Extending the Dwelling data

We can easily extend semantic data both by adding more things in new dimensions of established types - perhaps incorporating information about the methods used to obtain observations from a surveying ontology - and by defining our own properties and objects if no established models exist that meet our needs. Fig. 10 shows potential new properties (simple data) or relations (complex objects) added at two different points to our data: the Observation element and the Observed Feature.

![](lecture_resources/img/sem-dwellings-03.png)
**Figure 12**: New information - both model and data elements - added at various points to the data in Fig. 9.

> _ASIDE: Remember that with Semantic Web data, we are often able to directly reuse existing datasets. In this example, there is indeed a wealth of Semantic Web information about Mesh Blocks in the [ASGS Ontology](https://linked.data.gov.au/def/asgs) and [ASGS 2016](https://linked.data.gov.au/dataset/asgs2016) datasets that we could extend our dwellings data with. The [Loc-I project](https://www.ga.gov.au/locationindex) provides a number of Australian national datasets in Semantic web form, including multiple versions of the ASGS._

A dimension of data that is often of great relevance is that of provenance: the how and when of data's production. There is a well-known provenance ontology that we could use to provide us with properties and objects to describe how our dwellings data was collected and processed. Fig. 11 shows our data with some potential provenance extensions.

![](lecture_resources/img/sem-dwellings-04.png)
**Figure 13**: Provenance dimension information for the Observation and Observed Feature of our dwellings data indicated

### 7.3 Semantic multi-dimensional data querying

With our non-semantic data loaded into the SQLite relational database, we formulated the following SQL query that utilised multiple dimensions:
```
SELECT ContainingMB, AVG(NoPeople)
FROM dwellings
WHERE CensusYear = 2021
GROUP BY ContainingMB
```

With our semantic data, we can emulate this query in SPARQL like this:

In [23]:
from rdflib import Graph

# load the data
g = Graph().parse("lecture_resources/lecture_02_dwellings.ttl")
# query it
q = """
    PREFIX ex: <http://example.com/>
    PREFIX time: <http://www.w3.org/2006/time#>

    SELECT ?mb (AVG(?people) AS ?avg_people)
    WHERE {
        ?obs ex:hasNoPeople ?people ;
             time:hasTime ex:year-2021 ;
             ex:containingMB ?mb .
    }
    GROUP BY ?mb
    """
for r in g.query(q):
    print(f"{r[0].split('/')[-1]}, {r[1]}")  # should be the average we have seen before

50055290000, 2.5
50049040000, 3.5


This SPARQL query gives the same results as the SQL query and obtains them in approximately the same way by filtering the results according to a required year (2021) and grouping them by Mesh Block, however the result gives universally unique identifiers for the Mesh Blocks and we ar filtering not by a year value, "2021" but by association to a year object, `ex:year-2021`. Thus we won't get confused by Mesh block from different versions of the ASGS dataset and we can look them up (use the web link) to get more information about them, perhaps area. We can also use Mesh Block spatial and year temporal relations.

For example, this next query gets the same results again but uses spatial-temporal typology to get there, rather than specific values, i.e. results for _any year after 2016_, rather than _year = 2021_ and _any Mesh Block that touches MB 50055290000_ rather than _MB = 50055290000_:

In [24]:
q2 = """
    PREFIX ex: <http://example.com/>
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>
    PREFIX mb: <https://linked.data.gov.au/dataset/asgsed3/MB/>
    PREFIX time: <http://www.w3.org/2006/time#>

    SELECT ?mb (AVG(?people) AS ?avg_people)
    WHERE {
        ?obs ex:hasNoPeople ?people ;
             time:hasTime ?year ;
             ex:containingMB ?mb .

        ?year time:after ex:year-2016 .
        ?mb geo:sfTouches mb:50055290000 .
    }
    GROUP BY ?mb
    """
for r in g.query(q2):
    print(f"{r[0].split('/')[-1]}, {float(r[1]):.1f}")  # same results as before

50049040000, 3.5


### 7.4 SPARQL and dataframes

As with relational DB data accessed via SQL: we can easily export data using SPARQL queries into PANDAS dataframes. 

~There is no general SPARQL&rarr;dataframe (yet!) but we can just translate to CSV like this:~ - statement from 2021

We now have: https://github.com/RDFLib/sparqlwrapper/blob/master/SPARQLWrapper/sparql_dataframe.py

...but we can just translate to CSV like this:

In [25]:
q = """
    PREFIX ex: <http://example.com/>
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>

    SELECT ?mb ?people
    WHERE {
        ?obs
            ex:hasNoPeople ?people ;
            ex:containingMB ?mb ;
            time:hasTime ex:year-2021 ;
        .
    }
    """
csv = "MeshBlock,NoPeople\n"

# use the query above
for r in g.query(q):
    csv += f"{r['mb']},{r['people']}\n"

import pandas as pd
from io import StringIO
sparql_df = pd.read_csv(StringIO(csv))

print(sparql_df.head())

                                           MeshBlock  NoPeople
0  https://linked.data.gov.au/dataset/asgsed3/MB/...         3
1  https://linked.data.gov.au/dataset/asgsed3/MB/...         4
2  https://linked.data.gov.au/dataset/asgsed3/MB/...         1
3  https://linked.data.gov.au/dataset/asgsed3/MB/...         2
4  https://linked.data.gov.au/dataset/asgsed3/MB/...         2


<a id="sec-8"></a>
## 8. Conclusions / Suggested Actions

Rather than conclude with statements, I'm going to make suggestions.

* **learn about typologies relevant to your domain**
    * knowledge of these assists you in considering new possibilities
    * spatial, temporal, provenance, governance & partiness (how data is bundled into dataset) are all of quite universal interest and all have their own special relationships
* **learn a bit about relational databases**
    * if they are new to you, you may learn about multi-dimensional handing in SQL
    * consider using SQLite as you can use it easily with Python and thus with PANDAS, Matplotlib and so on
* **explore the RDFLib toolkit**
    * since RDFLib is in Python and works well with PANDAS etc., it's a natural place to start using Semantic Web data for data sience
    * https://rdflib.readthedocs.io
* **find ready-to-go Semantic data**
    * some starting points:
        * ASGS: https://linked.data.gov.au/dataset/asgsed3
        * GNAF: https://linked.data.gov.au/dataset/gnaf
        * Geofabric [https://linked.data.gov.au/dataset/geofabric](https://gnaf.linked.fsdf.org.au/dataset/gnaf/collections)
        * Indigenous Spatial Data: https://data.idnau.org/s
        * Australian Government Linked Data Working Group: https://www.linked.data.gov.au/
* **ask the Government for more Semantic Web data**
    * I think Semantic Web data is the best form of data for utility and transparency
    * government claims to make data best available, so it should be in Semantic Web form
    * Try hassling the National Data Commissioner: https://www.datacommissioner.gov.au

<a id="sec-9"></a>
## 9. References

* **[A83]**: Allen, James F. "Maintaining knowledge about temporal intervals". Communications of the ACM 26(11) pp.832-843, Nov. 1983.
* **[DEM9]**: Clementini E., Di Felice P., van Oosterom P. (1993) "A small set of formal topological relationships suitable for end-user interaction". In: Abel D., Chin Ooi B. (eds) Advances in Spatial Databases. SSD 1993. Lecture Notes in Computer Science, vol 692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56869-7_16
* **[GSP]**: Open Geospatial Consortium "OGC GeoSPARQL - A Geographic Query Language for RDF Data". Implementation standard (draft). https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html
* **[IS1]**: International Organization for Standardization "ISO 19125-1:2004 Geographic information -- Simple feature access -- Part 1: Common architecture". International standard.
* **[IS2]**: International Organization for Standardization "ISO 19125-2:2004 Geographic information -- Simple feature access -- Part 2: SQL option". International standard.
* **[TIM]**: Cox, Simon & Little, Chris (eds.) "Time Ontology in OWL". W3C Candidate Recommendation 26 March 2020. https://www.w3.org/TR/owl-time/
* **[TMF]**: Car, N.J. 2021. "RDFlib OWL TIME Functions". Software library online. https://github.com/rdflib/timefuncs