## Guest Lecture COMP7230
# Using Python packages for Linked Data & spatial data
#### by Dr Nicholas Car

This Notebook is the resource used to deliver a guest lecture for the [Australian National University](https://www.anu.edu.au)'s course [COMP7230](https://programsandcourses.anu.edu.au/2020/course/COMP7230): *Introduction to Programming for Data Scientists*

Click here to run this lecture in your web browser:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/HEAD?filepath=lecture.ipynb)

## About the lecturer
**Nicholas Car**:
* PhD in informatics for irrigation
* A former CSIRO informatics researcher
    * worked on integrating environmental data across government / industry
    * developed data standards
* Has worked in operation IT in government
* Now in a private IT consulting company, [SURROUND Australia Pty Ltd](https://surroundaustralia.com) supplying Data Science solutions

Relevant current work:

* building data processing systems for government & industry
* mainly using Python
    * due to its large number of web and data science packages
* maintains the [RDFlib](https://rdflib.net) Python toolkit
    * for processing [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework)
* co-chairs the [Australian Government Linked Data Working Group](https://www.linked.data.gov.au) with Armin Haller
    * plans for multi-agency data integration
* still developing data standards
    * in particular GeoSPARQL 1.1 (https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html) 
        * for graph representations of spatial information


## 0. Lecture Outline
1. Notes about this training material
2. Accessing RDF data
3. Parsing RDF data
4. Data 'mash up'
5. Data Conversions & Display


## 1. Notes about this training material

#### This tool
* This is a Jupyter Notebook - interactive Python scripting
* You will cover Jupyter Notebooks more, later in this course
* Access this material online at:
    * GitHub: <https://github.com/nicholascar/comp7230-training>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nicholascar/comp7230-training/?filepath=lecture.ipynb)

#### Background data concepts - RDF

_Nick will talk RDF using these web pages:_
    * [Semantic Web](https://www.w3.org/standards/semanticweb/) - the concept
    * [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) - the data model
        * refer to the RDF image below
    * [RDFlib](https://rdflib.net) - the (Python) toolkit
    * [RDFlib training Notebooks are available](https://github.com/surroundaustralia/rdflib-training)

The LocI project:
* The Location Index project: <http://loci.cat>

RDF image, from [the RDF Primer](https://www.w3.org/TR/rdf11-primer/), for discussion:

![](img/example-graph-iris.png)

Note that:
* _everything_ is "strongly" identified
    * including all relationships
    * unlike lots of related data
* many of the identifiers resolve
    * to more info (on the web)

## 2. Accessing RDF data

* Here we use an online structured dataset, the Geocoded National Address File for Australia
    * Dataset Persistent Identifier: <https://linked.data.gov.au/dataset/gnaf>
    * The above link redirects to the API at <https://gnafld.net>
* GNAF-LD Data is presented according to *Linked Data* principles
    * online
    * in HTML & machine-readable form, RDF
    * RDF is a Knowledge Graph: a graph containing data + model
    * each resource is available via a URI
        * e.g. <https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933>

![GAACT714845933](img/GAACT714845933.png)


2.1. Get the Address GAACT714845933 using the *requests* package

In [None]:
import requests  # NOTE: you must have installed requests first, it's not a standard package
r = requests.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933"
)
print(r.text)

2.2 Get machine-readable data, RDF triples
Use HTTP Content Negotiation
Same URI, different *format* of data

In [None]:
r = requests.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "application/n-triples"}
)
print(r.text)

2.3 Get machine-readable data, Turtle
Easier to read

In [None]:
r = requests.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "text/turtle"}
)
print(r.text)

## 3. Parsing RDF data

Import the RDFlib library for manipulating RDF data
Add some namespaces to shorten URIs

In [None]:
import rdflib
from rdflib.namespace import RDF, RDFS
GNAF = rdflib.Namespace("http://linked.data.gov.au/def/gnaf#")
ADDR = rdflib.Namespace("http://linked.data.gov.au/dataset/gnaf/address/")
GEO = rdflib.Namespace("http://www.opengis.net/ont/geosparql#")
print(GEO)

Create a graph and add the namespaces to it

In [None]:
g = rdflib.Graph()
g.bind("gnaf", GNAF)
g.bind("addr", ADDR)
g.bind("geo", GEO)

Parse in the machine-readable data from the GNAF-LD

In [None]:
r = requests.get(
    "https://linked.data.gov.au/dataset/gnaf/address/GAACT714845933",
    headers={"Accept": "text/turtle"}
)
g.parse(data=r.text, format="text/turtle")

Print graph length (no. of triples) to check

In [None]:
print(len(g))

Print graph content, in Turtle

In [None]:
print(g.serialize(format="text/turtle").decode())

### 3.1 Getting multi-address data:
3.1.1. Retrieve an index of 10 addresses, in RDF
3.1.2. For each address in the index, get each Address' data
* use paging URI: <https://linked.data.gov.au/dataset/gnaf/address/?page=1>
3.1.3. Get only the street address and map coordinates

#### 3.1.1. Retrieve index

In [None]:
# clear the graph
g = rdflib.Graph()

r = requests.get(
    "https://linked.data.gov.au/dataset/gnaf/address/?page=1",
    headers={"Accept": "text/turtle"}
)
g.parse(data=r.text, format="text/turtle")
print(len(g))

#### 3.1.2. Parse in each address' data

In [None]:
for s, p, o in g.triples((None, RDF.type, GNAF.Address)):
    print(s.split("/")[-1])
    r = requests.get(
        str(s),
        headers={"Accept": "text/turtle"}
    )
    g.parse(data=r.text, format="turtle")
    print(len(g))

The graph model used by the GNAF-LD is based on [GeoSPARQL 1.1](https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html) and looks like this:

![](img/geosparql-model.png)

#### 3.1.3. Extract (& print) street address text & coordinates
(CSV)

In [None]:
addresses_tsv = "GNAF ID\tAddress\tCoordinates\n"
for s, p, o in g.triples((None, RDF.type, GNAF.Address)):
    for s2, p2, o2 in g.triples((s, RDFS.comment, None)):
        txt = str(o2)
    for s2, p2, o2 in g.triples((s, GEO.hasGeometry, None)):
        for s3, p3, o3 in g.triples((o2, GEO.asWKT, None)):
            coords = str(o3).replace("<http://www.opengis.net/def/crs/EPSG/0/4283> ", "")

    addresses_tsv += "{}\t{}\t{}\n".format(str(s).split("/")[-1], txt, coords)

print(addresses_tsv)

#### 3.1.4. Convert CSV data to PANDAS DataFrame
(CSV)

In [None]:
import pandas
from io import StringIO
s = StringIO(addresses_tsv)
df1 = pandas.read_csv(s, sep="\t")
print(df1)


#### 3.1.5. SPARQL querying RDF data
A graph query, similar to a database SQL query, can traverse the graph and retrieve the same details as the multiple
loops and Python code above in 3.1.3.

In [None]:
q = """
SELECT ?id ?addr ?coords
WHERE {
    ?uri a gnaf:Address ;
         rdfs:comment ?addr .

    ?uri geo:hasGeometry/geo:asWKT ?coords_dirty .

    BIND (STRAFTER(STR(?uri), "address/") AS ?id)
    BIND (STRAFTER(STR(?coords_dirty), "4283> ") AS ?coords)
}
ORDER BY ?id
"""
for r in g.query(q):
    print("{}, {}, {}".format(r["id"], r["addr"], r["coords"]))

## 4. Data 'mash up'
Add some fake data to the GNAF data - people count per address.

The GeoSPARQL model extension used is:

![](img/geosparql-model-extension.png)

Note that for real Semantic Web work, the `xxx:` properties and classes would be "properly defined", removing any ambiguity of use.

In [1]:
import pandas
df2 = pandas.read_csv('fake_data.csv')
print(df2)

           GNAF ID   Persons
0   GAACT714845944         3
1   GAACT714845934         5
2   GAACT714845943        10
3   GAACT714845949         1
4   GAACT714845955         2
5   GAACT714845935         1
6   GAACT714845947         4
7   GAACT714845950         3
8   GAACT714845933         4
9   GAACT714845953         2
10  GAACT714845945         3
11  GAACT714845946         3
12  GAACT714845939         4
13  GAACT714845941         2
14  GAACT714845942         1
15  GAACT714845954         0
16  GAACT714845952         5
17  GAACT714845938         3
18  GAACT714845936         4
19  GAACT714845951         3


Merge DataFrames

In [None]:
df3 = pandas.merge(df1, df2)
print(df3.head())

## 5. Spatial Data Conversions & Display

Often you will want to display or export data.

#### 5.1 Display directly in Jupyter
Using standard Python plotting (matplotlib).

First, extract addresses, longitudes & latitudes into a dataframe using a SPARQL query to build a CSV string.

In [None]:
import re
addresses_csv = "Address,Longitude,Latitude\n"

q = """
    SELECT ?addr ?coords
    WHERE {
        ?uri a gnaf:Address ;
             rdfs:comment ?addr .

        ?uri geo:hasGeometry/geo:asWKT ?coords .

        BIND (STRAFTER(STR(?uri), "address/") AS ?id)
        BIND (STRAFTER(STR(?coords_dirty), "4283> ") AS ?coords)
    }
    ORDER BY ?id
    """
for r in g.query(q):
    match = re.search("POINT\((\d+\.\d+)\s(\-\d+\.\d+)\)", r["coords"])
    long = float(match.group(1))
    lat = float(match.group(2))
    addresses_csv += f'\"{r["addr"]}\",{long},{lat}\n'

print(addresses_csv)

Read the CSV into a DataFrame.

In [None]:
import pandas as pd
from io import StringIO
addresses_df = pd.read_csv(StringIO(addresses_csv))

print(addresses_df["Longitude"])

Display the first 5 rows of the DataFrame directly using matplotlib.

In [None]:
from matplotlib import pyplot as plt
addresses_df[:5].plot(kind="scatter", x="Longitude", y="Latitude", s=50, figsize=(10,10))

for i, label in enumerate(addresses_df[:5]):
    plt.annotate(addresses_df["Address"][i], (addresses_df["Longitude"][i], addresses_df["Latitude"][i]))
    
plt.show()

#### 5.2 Convert to common format - GeoJSON

Import Python conversion tools (shapely).

In [None]:
import shapely.wkt
from shapely.geometry import MultiPoint
import json

Loop through the graph using ordinary Python loops, not a query.

In [None]:
points_list = []

for s, p, o in g.triples((None, RDF.type, GNAF.Address)):
    for s2, p2, o2 in g.triples((s, GEO.hasGeometry, None)):
        for s3, p3, o3 in g.triples((o2, GEO.asWKT, None)):
            points_list.append(
                shapely.wkt.loads(str(o3).replace("<http://www.opengis.net/def/crs/EPSG/0/4283> ", ""))
            )

mp = MultiPoint(points=points_list)

geojson = shapely.geometry.mapping(mp)
print(json.dumps(geojson, indent=4))

Another, better, GeoJSON export - including Feature information.

First, build a Python dictionary matching the GeoJSON specification, then export it to JSON.

In [None]:
geo_json_features = []

# same query as above
for r in g.query(q):
    match = re.search("POINT\((\d+\.\d+)\s(\-\d+\.\d+)\)", r["coords"])
    long = float(match.group(1))
    lat = float(match.group(2))
    geo_json_features.append({
        "type": "Feature", 
        "properties": { "name": r["addr"] },
        "geometry": { 
            "type": "Point", 
            "coordinates": [ long, lat ] 
        } 
    })
    
geo_json_data = {
    "type": "FeatureCollection",
    "name": "test-points-short-named",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
    "features": geo_json_features
}

import json
geo_json = json.dumps(geo_json_data, indent=4)
print(geo_json)

Export the data and view it in a GeoJSON map viewer, such as http://geojsonviewer.nsspot.net/ or QGIS (desktop_.

## Concluding remarks

* Semantic Web, realised through Linked Data, builds a global machine-readable data system
* the RDF data structure is used
    * to link things
    * to define things, and the links
* specialised parts of the Sem Web can represent a/any domain
    * e.g. spatial
    * e.g. Addresses
* powerful graph pattern matching queries, SPARQL, can be used to subset (federated) Sem Web data
* RDF manipulation libraries exist
    * can convert to other, common forms, e.g. CSV GeoJSON
* _do as much data science work as you can with well-defined models!_

## License
All the content in this repository is licensed under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/). Basically, you can:

* copy and redistribute the material in any medium or format
* remix, transform, and build upon the material for any purpose, even commercially

You just need to:

* give appropriate credit, provide a link to the license, and indicate if changes were made
* not apply legal terms or technological measures that legally restrict others from doing anything the license permits

## Contact Information
**Dr Nicholas J. Car**<br />
*Data Systems Architect*<br />
[SURROUND Australia Pty Ltd](https://surroundaustralia.com)<br />
<nicholas.car@surroundaustralia.com><br />
GitHub: [nicholascar](https://github.com/nicholascar)<br />
ORCID: <https://orcid.org/0000-0002-8742-7730><br />