## Guest Lecture COMP7230
# Using Python packages for spatial Linked Data data
#### by Dr Nicholas Car

This Notebook is the resource used to deliver a guest lecture for the [Australian National University](https://www.anu.edu.au)'s course [COMP7230](https://programsandcourses.anu.edu.au/2020/course/COMP7230): *Introduction to Programming for Data Scientists*

Click here to run this lecture in your web browser:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/HEAD?filepath=lecture_01_2022.ipynb)

## About the lecturer
**Nicholas Car**:
* PhD in informatics for irrigation
* A former CSIRO informatics researcher
    * worked on integrating environmental data across government / industry
    * developed data standards
* Has worked in operational IT in government
* Now runs a private IT consulting company, [Kurrawong AI](https://kurrawong.net) supplying Data Science solutions

Relevant current work:

* building data processing systems for government & industry
* mainly using Python
    * due to its large number of web and data science packages
* maintains the [RDFlib](https://rdflib.net) Python toolkit
    * for processing [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework)
* co-chairs the [Australian Government Linked Data Working Group](https://www.linked.data.gov.au)
    * plans for multi-agency data integration
* still developing data standards
    * in particular [GeoSPARQL 1.1](https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html)
        * for graph representations of spatial information


## 0. Lecture Outline
1. Notes about this training material
2. Background Concepts
3. Accessing RDF data
4. Parsing RDF data
5. Data 'mash up'
6. Data Conversions & Display


## 1. Notes about this training material
* This is a Jupyter Notebook - interactive Python scripting
* You will cover Jupyter Notebooks more, later in this course
* Access this material online at:
    * GitHub: <https://github.com/nicholascar/comp7230-training>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/?filepath=lecture_01.ipynb)

## 2. Background Concepts
### 2.1 Knowledge Graphs & RDF
_Nick will talk about RDF using these web pages:_

* [Semantic Web](https://www.w3.org/standards/semanticweb/) - the concept
* [Knowledge Graph](https://en.wikipedia.org/wiki/Knowledge_graph)
    * IBM's version: <https://www.ibm.com/cloud/learn/knowledge-graph>
* [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) - the data model
    * refer to the RDF image below
* [RDFlib](https://rdflib.net) - the (Python) toolkit
* [RDFlib training Notebooks are available](https://github.com/nicholascar/rdflib-training)

RDF image, from [the RDF Primer](https://www.w3.org/TR/rdf11-primer/), for discussion:

![](./lecture_resources/img/example-graph-iris.jpg)

### 2.2 Australian national spatial datasets in RDF

![LocI Logo](./lecture_resources/img/LocI.png)

The LocI & FSDF DAA projects:
* The Location Index project: <https://www.ga.gov.au/locationindex>
* FSDF DAA's "Supermodel"
    * <https://geoscienceaustralia.github.io/fsdf-supermodel/supermodel.html>
* Knowledge Graph spatial data: [GeoSPARQL](https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html#_core)
* Operational APIs:
    * ASGS: <https://asgs.linked.fsdf.org.au/dataset/asgsed3/collections>
    * GNAF: <https://gnaf.linked.fsdf.org.au/dataset/gnaf/collections>
    * there are others too!

### 2.3 KG summary
* _everything_ is "strongly" identified
    * including all relationships
    * unlike lots of related data
* many of the identifiers resolve
    * to more info on the web
* KG spatial data looks a lot like regular spatial data
    * but it's connected to other things in a defined way

## 3. Accessing RDF data
* Here we use the API for the Geocoded National Address File for Australia for Address data
    * Addresses Collection: <https://gnaf.linked.fsdf.org.au/dataset/gnaf/collections/address>
* GNAF-LD Data is presented according to *Linked Data* principles
    * online
    * in HTML & machine-readable form, RDF
    * RDF is a Knowledge Graph: a graph containing data + model
    * each resource is available via a IRI
        * e.g. <https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933>

![GAACT714845933](./lecture_resources/img/GAACT714845933.png)

### 3.1. Use local RDF data
Some setup imports

In [None]:
import httpx
import rdflib
from rdflib.namespace import DCTERMS, GEO, RDF, RDFS

RDF can be stored in files in multiple formats optimised for different purposes.

A commonly-used format is JSON-LD - a JSON encoding of RDF. Let's parse a JSON-LD data file for the address GAACT714845933 into an in-memory graph and print out the number of triples.

In [None]:
print(open("./lecture_resources/GAACT714845933.json-ld").read())

### 3.2. Get Address GAACT714845933 data online using the *httpx* package

In [None]:
r = httpx.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    follow_redirects=True
)
print(r.text.strip())

Not so easy to use the HTML we got above!

### 3.3 Get machine-readable data, RDF in JSON-LD
Use HTTP Content Negotiation to get the same JSON-LD as stored locally
Same IRI, different *format* of data

In [None]:
r = httpx.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "application/ld+json"},
    follow_redirects=True
)
print(r.text)

Let's get a different RDF format...

### 3.4 Get machine-readable data, Turtle. Easier to read

In [None]:
r = httpx.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "text/turtle"},
    follow_redirects=True
)
print(r.text)

## 4. Parsing RDF data

### 4.1 Using RDF tools - RDFLib
Import the RDFlib library for manipulating RDF data

Add some namespaces to shorten IRIs

In [None]:
ADDR = rdflib.Namespace("http://w3id.org/profile/anz-address/")
print(GEO)

Create a graph and add the namespaces to it

In [None]:
g = rdflib.Graph(bind_namespaces="core") # RDF & RDFS added
g.bind("addr", ADDR)
g.bind("geo", GEO)
print(g)

Parse in the machine-readable data - JSON-LD RDF - from the GNAF online

In [None]:
r = httpx.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "application/ld+json"},
    follow_redirects=True
)
g.parse(data=r.text, format="json-ld")
print(len(g))

> Why is this so much better than parsing ordinary CSV?
> What are the main KG advantages?
>
> * strong definitions
> * universal models
> * extensible models
> * tooling for instance parsing
> * easily queryable results

Print graph content, in a different format from that which we got - Turtle

In [None]:
print(g.serialize())

### 4.2 Getting multi-address data:
4.2.1. Retrieve a list of 20 addresses, in RDF
4.2.2. Get individual object data from a list
4.2.3. Get only the street address and map coordinates
4.2.4. Convert CSV data to PANDAS DataFrame
4.2.5. SPARQL querying RDF data

* The GNAF has ~14.5M Addresses in it
* The Linked Data APIs we are using page data: <http://gnaf.linked.fsdf.org.au/dataset/gnaf/collections/address/items?page=1>

#### 4.2.1. Retrieve a list of objects (1 page)

In [None]:
g = rdflib.Graph()

r = httpx.get(
    "http://gnaf.linked.fsdf.org.au/dataset/gnaf/collections/address/items?_profile=mem&_mediatype=text/turtle",
    headers={"Accept": "text/turtle"},
    follow_redirects=True
)
g.parse(data=r.text)
print(len(g))

This list is also in RDF!

Show the IDs of first few by looping through the graph

In [None]:
for i, t in enumerate(g.triples((None, RDFS.member, None))):
    print(f'Address {i+1}: {t[2].split("/")[-1]}')
    if i > 4:
        break

#### 4.2.2. Get individual object data from a list

For each Address in a list, retrieve it's RDF from the API online.

In [None]:
g = rdflib.Graph()
g.bind("addr", ADDR)
g.bind("geo", GEO)

addresses = [
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845944",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845934",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845943",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845949",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845955",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845935",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845947",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845950",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845953",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845945",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845946",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845939",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845941",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845942",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845954",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845952",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845938",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845936",
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845951",
]
for address in addresses:
    r = httpx.get(
        address,
        headers={"Accept": "text/turtle"},
        follow_redirects=True
    )
    g.parse(data=r.text, format="turtle")
    print(f"Getting {address}...")
    print(len(g))

Let's merge in some local data

In [None]:
print(f"Before merge, graph length: {len(g)}")
g.parse("./lecture_resources/address_geometries.ttl")
print(f"After merge, graph length: {len(g)}")

#### 4.2.3. Extract (& print) street address text & coordinates
As CSV...

In [None]:
addresses_tsv = "id\tcoordinates\n"
for s, p, o in g.triples((None, RDF.type, ADDR.Address)):
    id = g.value(s, DCTERMS.identifier)
    coords = ""
    for s2, p2, o2 in g.triples((s, ADDR.hasQualifiedGeometry, None)):
        for s3, p3, o3 in g.triples((o2, GEO.hasGeometry, None)):
            for s4, p4, o4 in g.triples((o3, GEO.asWKT, None)):
                coords = str(o4).strip()

    addresses_tsv += "{}\t{}\n".format(id, coords)

print(addresses_tsv)

#### 4.2.4. Convert CSV data to PANDAS DataFrame

In [None]:
import pandas
from io import StringIO
s = StringIO(addresses_tsv)
df1 = pandas.read_csv(s, sep="\t")
print(df1)


#### 4.2.5. SPARQL querying RDF data
A graph query, similar to a database SQL query, can traverse the graph and retrieve the same details as the multiple
loops and Python code above in 3.1.3.

In [None]:
q = """
PREFIX addr: <http://w3id.org/profile/anz-address/>
SELECT ?id ?coords
WHERE {
    ?iri dcterms:identifier ?id .

    ?iri addr:hasQualifiedGeometry/geo:hasGeometry/geo:asWKT ?coords .
}
ORDER BY ?id
"""
for r in g.query(q):
    print("{}, {}".format(r["id"], r["coords"]))

The query above uses a fancy 'path follower' clause: `addr:hasQualifiedGeometry/geo:hasGeometry/geo:asWKT`

## 5. Data 'mash up'
Add some fake data to the GNAF data - people count per address.

The GeoSPARQL model extension used is:

![](./lecture_resources/img/geosparql-model-extension.png)

Note that for real Knowledge Graph work, the `xxx:` properties and classes would be "properly defined", removing any ambiguity of use.

In [None]:
import pandas
df2 = pandas.read_csv('./lecture_resources/fake_data.csv')
print(df2)

Merge DataFrames

In [None]:
df3 = pandas.merge(df1, df2)
print(df3.head())

## 6. Spatial Data Conversions & Display

Often you will want to display or export data.

#### 6.1 Basic plot display
Using standard Python plotting (matplotlib).

First, extract longitudes & latitudes

In [None]:
import re
addresses_csv = "id,lon,lat\n"

q = """
    PREFIX addr: <http://w3id.org/profile/anz-address/>
    SELECT ?id ?coords
    WHERE {
        ?iri dcterms:identifier ?id .

        ?iri addr:hasQualifiedGeometry/geo:hasGeometry/geo:asWKT ?coords .
    }
    ORDER BY ?id
    """
for r in g.query(q):
    match = re.search("POINT\s\((\d+\.\d+)\s(\-\d+\.\d+)\)", r["coords"])
    long = float(match.group(1))
    lat = float(match.group(2))
    addresses_csv += f'\"{r["id"]}\",{long},{lat}\n'

print(addresses_csv)

Put this new CSV data into a dataframe

In [None]:
df4 = pandas.read_csv(StringIO(addresses_csv))
print(df4.head())

Merge in the persons data

In [None]:
df5 = pandas.merge(df4, df2)
print(df5.head())

Display

In [None]:
from matplotlib import pyplot as plt

df5.plot(kind="scatter", x="lon", y="lat", s=50, figsize=(10,10))

for i, label in enumerate(df5):
    plt.annotate(df5["persons"][i], (df5["lon"][i], df5["lat"][i]))
    
plt.show()


#### 5.2 Better map display

Just use a toolkit - MapBox!

First, convert data format to GeoJSON

In [None]:
addresses_geojson = []
for index, row in df5.iterrows():
    addresses_geojson.append({
        "type": "Feature",
        "geometry": {
            "type": "Point",
            "coordinates": [row["lon"], row["lat"]]
        },
        "properties": {
            "id": row["id"],
            "persons": row["persons"]
        }
    })
addresses_geojson = {
    "type": "FeatureCollection", 
    "features": addresses_geojson
}
import json
print(json.dumps(addresses_geojson, indent=4))

In [None]:
# the public MapBox token
token = "pk.eyJ1IjoibmljaG9sYXNjYXIiLCJhIjoiY2w3aWFkbXp2MDdrZjN2czMwMmYydmkwZiJ9.o-BIM9Fktde7bjgWZ8Ti5A"

from mapboxgl.utils import create_color_stops
from mapboxgl.viz import CircleViz

viz = CircleViz(addresses_geojson,
                access_token=token,
                height='500px',
                label_property='id',
                color_property='persons',
                color_default='grey',
                color_function_type='match',
                color_stops=create_color_stops([0, 2, 4, 6, 8], colors='YlOrRd'),
                radius=2,
                center=(149.19, -35.25),
                zoom=10)
viz.show()

## Concluding remarks

* Knowledge Graphs, realised through Linked Data, build a global machine-readable data system - the Seamntic Web
* the RDF data structure is used
    * to link things
    * to define things, and the links
* specialised parts of the Sem Web can represent a/any domain
    * e.g. spatial
    * e.g. Addresses
* powerful graph pattern matching queries, SPARQL, can be used to subset (federated) Sem Web data
* RDF manipulation libraries exist
    * can convert to other, common forms, e.g. CSV GeoJSON
* _do as much data science work as you can with well-defined models!_

## License
All the content in this repository is licensed under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/). Basically, you can:

* copy and redistribute the material in any medium or format
* remix, transform, and build upon the material for any purpose, even commercially

You just need to:

* give appropriate credit, provide a link to the license, and indicate if changes were made
* not apply legal terms or technological measures that legally restrict others from doing anything the license permits

## Contact Information
**Dr Nicholas J. Car**<br />
*Data Systems Architect*<br />
[SURROUND Australia Pty Ltd](https://surroundaustralia.com)<br />
<nicholas.car@surroundaustralia.com><br />
GitHub: [nicholascar](https://github.com/nicholascar)<br />
ORCID: <https://orcid.org/0000-0002-8742-7730><br />