In [1]:
# DATA2001 Week 7 Tutorial Solutions
# Material last updated: 7 Apr 2025
# Note: this notebook was designed with the Roboto Condensed font, which can be installed here: https://www.1001fonts.com/roboto-condensed-font.html

from IPython.display import HTML
HTML('''
    <style> body {font-family: "Roboto Condensed Light", "Roboto Condensed";} h2 {padding: 10px 12px; background-color: #E64626; position: static; color: #ffffff; font-size: 40px;} .text_cell_render p { font-size: 15px; } .text_cell_render h1 { font-size: 30px; } h1 {padding: 10px 12px; background-color: #E64626; color: #ffffff; font-size: 40px;} .text_cell_render h3 { padding: 10px 12px; background-color: #0148A4; position: static; color: #ffffff; font-size: 20px;} h4:before{ 
    content: "@"; font-family:"Wingdings"; font-style:regular; margin-right: 4px;} .text_cell_render h4 {padding: 8px; font-family: "Roboto Condensed Light"; position: static; font-style: italic; background-color: #FFB800; color: #ffffff; font-size: 18px; text-align: center; border-radius: 5px;}input[type=submit] {background-color: #E64626; border: solid; border-color: #734036; color: white; padding: 8px 16px; text-decoration: none; margin: 4px 2px; cursor: pointer; border-radius: 20px;}</style>
''')

# Week 7 - Web APIs and Semi-Structured Data

This week will be going beyond scraping data from websites and using APIs to help collect data efficiently. Web APIs are purposefully provided by vendors to allow formal access to data, meaning it is often quite well-defined and consistent.

This tutorial offers an introduction to the potential that APIs offer, and hopefully helps you consider the possibilities of data integrations in your future projects. It will also continue to focus on **semi-structured data**, and how we can transform this into a tabular form.

Similarly to last week, our content will require the following Python libraries (none of which you should need to install):
- **Requests** for interacting with websites and web services
- **JSON** for handling JSON semi-structured objects
- **Pandas** for dataframe management

#### A note on the SQL Quiz

Much of this week's tutorial time will be spent conducting the **SQL Quiz assessment**. There won't be time to cover all the content below, but it is included for your reference so you can begin to apply hands-on examples of web APIs. The only "tasks" are towards the end and are SQL recap-based, rather than API tests. The accompanying text and descriptions, combined with the lecture, should be sufficient to read through and test out in your own time, and we will make solutions available as usual. Questions welcome on Ed, as always.

The hope is that tutorial time will allow for **Sections 1** (for an introduction), **and 2** (for the benefit of the group assignment), but there is no issue if this needs to be left for next week to unpack in detail. Sections 3 and 4 are simply expansions of working with APIs and semi-structured data, and may not be covered in-person in either this week or the next.

## 1. Introduction to JSON and Web APIs

Let's begin by exploring a few fun, simple examples of APIs with a variety of data types.

### 1.1 Exploring JSON with APIs

At the core, it's worth acknowleding that accessing APIs is much like accessing a webpage, as we did in the Week 6 tutorial.

Accordingly, we'll import the `requests` library to start, but also a little extra to allow later display of images:

In [2]:
import requests
import ipywidgets as widgets
def display_image(response, w=200, h=300):
    return widgets.Image(value=response.content, format='jpg', width=w, height=h)

From there, let's test out the [Stanford Dogs Dataset](https://dog.ceo/dog-api/documentation/):

In [3]:
response = requests.get('https://dog.ceo/api/breeds/list/all')
response.text

'{"message":{"affenpinscher":[],"african":[],"airedale":[],"akita":[],"appenzeller":[],"australian":["kelpie","shepherd"],"bakharwal":["indian"],"basenji":[],"beagle":[],"bluetick":[],"borzoi":[],"bouvier":[],"boxer":[],"brabancon":[],"briard":[],"buhund":["norwegian"],"bulldog":["boston","english","french"],"bullterrier":["staffordshire"],"cattledog":["australian"],"cavapoo":[],"chihuahua":[],"chippiparai":["indian"],"chow":[],"clumber":[],"cockapoo":[],"collie":["border"],"coonhound":[],"corgi":["cardigan"],"cotondetulear":[],"dachshund":[],"dalmatian":[],"dane":["great"],"danish":["swedish"],"deerhound":["scottish"],"dhole":[],"dingo":[],"doberman":[],"elkhound":["norwegian"],"entlebucher":[],"eskimo":[],"finnish":["lapphund"],"frise":["bichon"],"gaddi":["indian"],"germanshepherd":[],"greyhound":["indian","italian"],"groenendael":[],"havanese":[],"hound":["afghan","basset","blood","english","ibizan","plott","walker"],"husky":[],"keeshond":[],"kelpie":[],"kombai":[],"komondor":[],"ku

This output appears to be on the right track, but is definitely tricky to interpret in simple text format! It is really a `JSON` object - a common form of semi-structured data that may be reminiscent of our HTML encounters in Week 6. We can read it in appropriately using the ``.loads()`` function in the JSON library, which can parse a valid string into a JSON object.

In [4]:
import json
breeds = json.loads(response.text)
breeds.keys()

dict_keys(['message', 'status'])

As seen by observing the keys above, or consulting the documentation, two things are returned - a "message", and the "status". We're interested in the "message". We can return output without further use of the JSON library, but it's worthwhile pointing out the ``.dumps()`` function, which can output JSON objects (or parts within) as simple strings, with desired formatting (e.g. indenting using `indent`, sorted keys using `sort_keys`, etc)

In [5]:
print(json.dumps(breeds['message'], indent=4))

{
    "affenpinscher": [],
    "african": [],
    "airedale": [],
    "akita": [],
    "appenzeller": [],
    "australian": [
        "kelpie",
        "shepherd"
    ],
    "bakharwal": [
        "indian"
    ],
    "basenji": [],
    "beagle": [],
    "bluetick": [],
    "borzoi": [],
    "bouvier": [],
    "boxer": [],
    "brabancon": [],
    "briard": [],
    "buhund": [
        "norwegian"
    ],
    "bulldog": [
        "boston",
        "english",
        "french"
    ],
    "bullterrier": [
        "staffordshire"
    ],
    "cattledog": [
        "australian"
    ],
    "cavapoo": [],
    "chihuahua": [],
    "chippiparai": [
        "indian"
    ],
    "chow": [],
    "clumber": [],
    "cockapoo": [],
    "collie": [
        "border"
    ],
    "coonhound": [],
    "corgi": [
        "cardigan"
    ],
    "cotondetulear": [],
    "dachshund": [],
    "dalmatian": [],
    "dane": [
        "great"
    ],
    "danish": [
        "swedish"
    ],
    "deerhound": [
        "sc

The "[breed](https://dog.ceo/dog-api/documentation/breed)" page of the documentation defines the general format of a URL to return a random image of a selected dog breed. The function below leverages this to return a single image from the selected breed:

In [6]:
def random_dog(breed):
    # This command returns a URL to a random image of the selected dog breed.
    # We would need to do another call of requests.get() to get the actual image.
    response = requests.get(f'https://dog.ceo/api/breed/{breed}/images/random')
    breed = json.loads(response.text)['message']
    response_image = requests.get(breed)
    return display_image(response_image)

Try choosing a breed from the JSON above, and returning an image for it below (currently returns a 'husky'):

In [7]:
random_dog('husky')

Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x08\x06\x0…

### 1.2 Exploring XML Objects with APIs

We'll dig further into providing parameters to an API query in the next section, but we'll also quickly investigate other forms in which semi-structured data could be returned by an API. **XML** is another common format, that's more akin to HTML in its syntax.

We'll try calling a specific API reference from the [World Bank API](https://documents.worldbank.org/en/publication/documents-reports/api), which provides many free economic metrics, among which population features. The URL below returns Australia's population over time, as represented in [their own webpages](https://data.worldbank.org/indicator/SP.POP.TOTL?locations=AU).

The cell below will visually yield nothing of particular use, just the HTTP status code of our request (you should get [200](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), which indicates success).

In [8]:
response = requests.get("https://api.worldbank.org/v2/country/AU/indicator/SP.POP.TOTL?date=1994:2024")
response

<Response [200]>

Since XML is so similar to HTML, we can return to the BeautifulSoup library from last week, and use this to read in our data (this time using the `xml` parser, rather than `html5lib`).

In [9]:
from bs4 import BeautifulSoup
content = BeautifulSoup(response.text, 'xml')
content

<?xml version="1.0" encoding="utf-8"?>

From there, we can again deconstruct the output in a similar manner. This is more to demonstrate what can be done - compare the code below to the XML output received when directly visiting the webpage requested yourself.

In [10]:
content = BeautifulSoup(response.text, 'lxml')  # parsing the webpage content as XML
for x in content.find_all('wb:data'):  # iterating through each "wb:data" tag
    if x.find('wb:value').text:  # if a value exists for a given row
        print(x.find('wb:date').text, ': ', x.find('wb:value').text, sep='')  # print

2019: 25303000
2018: 24982688
2017: 24601860
2016: 24190907
2015: 23815995
2014: 23475686
2013: 23128129
2012: 22733465
2011: 22340024
2010: 22031750
2009: 21691700
2008: 21249200
2007: 20827600
2006: 20697900
2005: 20394800
2004: 20127400
2003: 19895400
2002: 19651400
2001: 19413000
2000: 19153000
1999: 18926000
1998: 18711000
1997: 18517000
1996: 18311000
1995: 18072000
1994: 17855000


Ultimately, the above examples are simplistic, in that the URLs required to access data are self-contained (no further iteration necessary), and the information that is provided is singular (a simple dictionary or image). Often the data we seek to extract is much more complicated.

We'll explore some further expansions in Sections 3-4, but first, with the group assignment pending, we'll segue into the extraction of geospatial data via APIs:

## 2. Spatial Data APIs

APIs can return spatial data, which we'll cover in further depth next week (Week 8). The group assignment involves usage of the [NSW Points of Interest API](https://datasets.seed.nsw.gov.au/dataset/nsw-points-of-interest-poi), which we'll provide a demonstration of below.

If you open up the "SEED Map" linked in the linked page above, you'll be able to scroll around and see many points of interest. Let's hone in on Fisher Library, for example. The cell below demonstrates how query string parameters can be used (noting of course that documentation must be consulted to know what fields and values are acceptable!):

In [11]:
params = {
    'where': "poiname='FISHER LIBRARY SYDNEY UNIVERSITY'",
    'outFields': '*',
    'returnGeometry': 'true',
    'f': 'json'
}
response = requests.get('https://maps.six.nsw.gov.au/arcgis/rest/services/public/NSW_POI/MapServer/0/query', params=params)
response

<Response [200]>

Hopefully the above cell returns a HTTP status code of 200, to indicate success. Unpacking the actual data itself:

In [12]:
places = json.loads(response.text)
print(json.dumps(places['features'][0], indent=2))

{
  "attributes": {
    "objectid": 3083,
    "topoid": 500323885,
    "poigroup": 1,
    "poitype": "Library",
    "poiname": "FISHER LIBRARY SYDNEY UNIVERSITY",
    "poilabel": "FISHER LIBRARY SYDNEY UNIVERSITY",
    "poilabeltype": "NAMED",
    "poialtlabel": null,
    "poisourcefeatureoid": 14,
    "accesscontrol": 1,
    "startdate": 1344594503000,
    "enddate": 32503680000000,
    "lastupdate": 1344594513266,
    "msoid": 143190,
    "centroidid": null,
    "shapeuuid": "71adb868-ad78-3335-adbb-222ed14d68a4",
    "changetype": "M",
    "processstate": null,
    "urbanity": "U"
  },
  "geometry": {
    "x": 151.19049355206874,
    "y": -33.886316917424516
  }
}


The above example is helpful if we want to return information about a specific POI already known by name to us. Instead of our `where` parameter, we'll utilise the API's `geometry` field, which allows us to define a geographical area of region of interest (by defining the corners via `xmin`, `xmax`, `ymin` and `ymax`).

Consider the below helper function, which allows users to define a midpoint coordinate, then returns nearby POI values. The `boxsize` argument can be adjusted for wider or finer margins, and is defined in kilometres (e.g. `boxsize=5` as the default will yield a 5km by 5km region of interest around the midpoint):

In [13]:
def nearbyPOI(coordinates, boxsize=5, filters={}):
    baseURL = 'https://maps.six.nsw.gov.au/arcgis/rest/services/public/NSW_POI/MapServer/0/query'
    lat, lon = round(coordinates[0], 5), round(coordinates[1], 5)
    delta = boxsize/(100*2)  # Australia's DCCEEW defines 1 of a degree of latitude as roughly 1/100th of a km, hence division by 100. The extra division by 2 is to ensure the full width is divided in each direction (e.g. 2.5km left and right)
    params = {
        'geometry': f'"xmin":{lon-delta},"ymin":{lat-delta},"xmax":{lon+delta},"ymax:{lat+delta}"',
        'outFields': '*',
        'returnGeometry': 'true',
        'f': 'json'
    }
    response = requests.get(baseURL, params)
    return json.loads(response.text)['features']

According to Google, the GPS coordinates of the USYD Quadrangle are as defined below. Let's test out our function:

In [14]:
coordinates = (-33.88566318983135, 151.18885144061616) 
results = nearbyPOI(coordinates, boxsize=1)
results

[{'attributes': {'objectid': 2631,
   'topoid': 500301659,
   'poigroup': 3,
   'poitype': 'Park',
   'poiname': 'RESIDENTS PARK',
   'poilabel': 'RESIDENTS PARK',
   'poilabeltype': 'NAMED',
   'poialtlabel': None,
   'poisourcefeatureoid': 61,
   'accesscontrol': 1,
   'startdate': 1285588392000,
   'enddate': 32503680000000,
   'lastupdate': 1285588392535,
   'msoid': 90267,
   'centroidid': None,
   'shapeuuid': '5cb05e3e-2a29-3861-9afa-bb18de788110',
   'changetype': 'I',
   'processstate': None,
   'urbanity': 'U'},
  'geometry': {'x': 151.1881886978318, 'y': -33.88365602309533}},
 {'attributes': {'objectid': 3081,
   'topoid': 500323830,
   'poigroup': 3,
   'poitype': 'Sports Field',
   'poiname': 'ST PAULS OVAL',
   'poilabel': 'ST PAULS OVAL',
   'poilabeltype': 'NAMED',
   'poialtlabel': None,
   'poisourcefeatureoid': 67,
   'accesscontrol': 1,
   'startdate': 1285588392000,
   'enddate': 32503680000000,
   'lastupdate': 1285588392535,
   'msoid': 99056,
   'centroidid': No

There's a lot there! Summarising further to sanity check:

In [15]:
[x['attributes']['poiname'] for x in results if x['attributes']['poiname']]

['RESIDENTS PARK',
 'ST PAULS OVAL',
 'THE UNIVERSITY OF SYDNEY CAMPERDOWN CAMPUS',
 'FISHER LIBRARY SYDNEY UNIVERSITY',
 'LAKE NORTHAM',
 'GLEBE FIRE STATION',
 'UNIVERSITY OVAL NUMBER TWO',
 'UNIVERSITY OVAL NUMBER ONE',
 'ST ANDREWS OVAL',
 'VICTORIA PARK',
 'SYDNEY UNIVERSITY POST OFFICE',
 'THE UNIVERSITY OF SYDNEY',
 'GLEBE TOWN HALL',
 'ST JOHNS CHURCH HALL',
 'AUSTRALIAN PERFORMING ARTS GRAMMAR SCHOOL',
 "ST JOHN'S VILLAGE",
 'GLEBE PUBLIC SCHOOL',
 'PETER FORSYTH AUDITORIUM',
 'GLEBE NEIGHBOURHOOD SERVICE CENTRE',
 'ROBYN KEMMIS RESERVE',
 'VICTORIA PARK POOL']

#### A parting note on application for the group assignment...

The above function we built returns anything nearby a given **point** (latitude/longitude coordinates of a midpoint), *using* a bounding box.

This will need to be extended for the group assignment, as the requirements there are to iterate through each SA2 area within your selected regions, to find its bounding box, and to feed **this bounding box** into a similar function (i.e. not midpoint based).

Next week we'll also look further into how the geographic output (i.e. the x/y coordinates of each returned POI) are best to be processed and stored, both in Pandas (GeoPandas) and SQL (PostGIS). More on that then - but the example above should suffice for demonstration purposes.

## 3. Extracting Data Using Web APIs

Most of the following examples will focus on [Project Gutenberg](https://en.wikipedia.org/wiki/Project_Gutenberg), an altriustic undertaking that has focused on digitising and providing culturally important texts for now 50 years. As part of this motivation to ensure open access to all, there is a "[Gutendex](https://gutendex.com/)" which allows API access to metadata of all books in the collection. Check out it's documentation (linked above)!

### 3.1 Nested JSON Objects

As described in the documentation, we can load in book data from across their library by combining the URL they provide, and our requests/JSON Python approach from before. This returns a JSON objects with a few attributes:

In [16]:
books = json.loads(requests.get('https://gutendex.com/books').text)
books.keys()

dict_keys(['count', 'next', 'previous', 'results'])

If we briefly investigate the data types of each of the four objects, we can see `count` is a number, `next` is a simple string, `previous` is empty, and `results` will likely contain the bulk of what we're interested in, given it is a list.

In [17]:
for k in books.keys():
    print(k, type(books[k]))

count <class 'int'>
next <class 'str'>
previous <class 'NoneType'>
results <class 'list'>


Equipped with this knowledge, we can investigate the values of each. The `count` reveals the current size of the Gutenberg digital library - **over 75,000 texts**! Yet despite so many texts, it appears **only 32 have been returned** in our `results` object. This is intentional, to avoid our API call unintentionally extracting everything at once, which could be a massive download of information (even when just metadata!). There is a way around this involving pagination, which we'll discuss later, and involves the `next` field.

In [18]:
print('Count:', books['count'])
print('Results:', len(books['results']))
print('Next:', books['next'])

Count: 75623
Results: 32
Next: https://gutendex.com/books/?page=2


Let's investigate our `results` object by considering just an arbitrary item - Shakespeare's famous play "Romeo and Juliet" is the fifth most downloaded (index #4), for example.

This is a good example of a proper **JSON object**, which is effectively just a combination of nested lists and dictionaries. For example, while `id` may be a simple field with a single number, `authors` is a list, in case a text was written by multiple people, and each value within the list is a dictionary, to allow distinction between author fields (their name, birth year and death year).

In [19]:
books['results'][4]

{'id': 1513,
 'title': 'Romeo and Juliet',
 'authors': [{'name': 'Shakespeare, William',
   'birth_year': 1564,
   'death_year': 1616}],
 'summaries': ['"Romeo and Juliet" by William Shakespeare is a tragedy likely written during the late 16th century. The play centers on the intense love affair between two young lovers, Romeo Montague and Juliet Capulet, whose families are embroiled in a bitter feud. Their love, while passionate and profound, is met with adversities that ultimately lead to tragic consequences.  At the start of the play, a Prologue delivered by the Chorus sets the stage for the tale of forbidden love, revealing the familial conflict that surrounds Romeo and Juliet. The opening scenes depict a public brawl ignited by the feud between the Montagues and Capulets, showcasing the hostility that envelops their lives. As we are introduced to various characters such as Benvolio, Tybalt, and Mercutio, we learn of Romeo\'s unrequited love for Rosaline. However, this quickly chan

### 3.2 Requesting Links within Objects

In our Romeo and Juliet example above, notice within the final key `formats`, there exists an `image/jpeg` key which contains a URL. If we navigate to that field and request that link from the web, we can render its logo using our image display skills from earlier.

In [20]:
display_image(requests.get(books['results'][4]['formats']['image/jpeg']))

Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x03\x02\x0…

To be super comprehensive, we could even notice a `text/html` link is also provided. Using our **web scraping** skills from last week's tutorial, we could additionally fetch this URL, and parse its webpage contents.

In [21]:
webpage_source = requests.get(books['results'][4]['formats']['text/html']).text
content = BeautifulSoup(webpage_source, 'html5lib')

From there, for example, we could print all `h2` and `h3` fields within it, which provides a nice simple overview of what to expect within the classic play (code below also adds in a line-break after the major headings, for ease of reading):

In [22]:
for header in content.find_all(['h2', 'h3']):
    if header.name == 'h2':
        print('_____\n')
    print(header.text)

_____

The Project Gutenberg eBook of Romeo and Juliet
_____

by William Shakespeare
Contents
 Dramatis Personæ 
 SCENE. During the greater part of the Play in Verona; once, in
the Fifth Act, at Mantua.
 THE PROLOGUE
_____

 ACT I
 SCENE I. A public place.
 SCENE II. A Street.
 SCENE III. Room in Capulet’s House.
 SCENE IV. A Street.
 SCENE V. A Hall in Capulet’s House.
_____

 ACT II
 SCENE I. An open place adjoining Capulet’s Garden.
 SCENE II. Capulet’s Garden.
 SCENE III. Friar Lawrence’s Cell.
 SCENE IV. A Street.
 SCENE V. Capulet’s Garden.
 SCENE VI. Friar Lawrence’s Cell.
_____

 ACT III
 SCENE I. A public Place.
 SCENE II. A Room in Capulet’s House.
 SCENE III. Friar Lawrence’s cell.
 SCENE IV. A Room in Capulet’s House.
 SCENE V. An open Gallery to Juliet’s Chamber, overlooking the
Garden.
_____

 ACT IV
 SCENE I. Friar Lawrence’s Cell.
 SCENE II. Hall in Capulet’s House.
 SCENE III. Juliet’s Chamber.
 SCENE IV. Hall in Capulet’s House.
 SCENE V. Juliet’s Chamber; Juliet on t

But we won't delve too deeply into that yet - a whole week on text data awaits after the break :)

### 3.3 Query Parameters and Pagination

Commonly within APIs, further **query parameters** can be provided to the website when requesting its content. These are commonly achieved by a `?` at the end of the URL, followed by a simple key-value list of all conditions required.

_(with the `requests` library in Python, we can also pass them as parameters via a dictionary, so this has also been included as a code comment)_

For example in the [Gutendex docs](https://gutendex.com/), a "topic" can be selected of the user's choice, for the data it returns. A 20th century American author Anne Haight compiled a collection of "[banned books](https://www.gutenberg.org/ebooks/bookshelf/336)" from different points in time around the world, which we can access by restricting our API call to only those with a "topic" of 'banned'. This will return a much smaller sample of less than 200 texts.

In [23]:
bannedbooks = json.loads(requests.get('https://gutendex.com/books?topic=banned').text)
#bannedbooks = json.loads(requests.get('https://gutendex.com/books', params={'topic': 'banned'}).text)
print('Count:', bannedbooks['count'])
print('Results:', len(bannedbooks['results']))
print('Next:', bannedbooks['next'])

Count: 181
Results: 32
Next: https://gutendex.com/books/?page=2&topic=banned


It's now worth noting the point of the `next` field. Since Gutendex API calls are limited to 32 results at a time, each call will indicate how to retrieve the *next* 32 rows on the next "page". For our banned books, this is a very similar URL, just now also with a `page=2` query parameter.

By providing this, we don't have to try and reverse-engineer the URLs to return a larger dataset, we can simply iterate through, each time noting the `next` link, until one no longer exists. The query below does just that, with attached code comments so you can follow along:

In [24]:
import time as t  # using the inbuilt "time" module for explicit wait times in our code

keepgoing = True  # establishing a simple variable indicating if the looping should continue
URL = 'https://gutendex.com/books?topic=banned'  # our base URL to begin with
results = []  # an empty list ready to store our results as we go

print('Data loading begins...')  # message to the user
while keepgoing:  # a WHILE loop that depends on the "keepgoing" variable being True
    t.sleep(2)  # purposefully waiting a minute before we retrieve the URL's contents
    print(URL)  # printing the link we're using in the API call
    page = json.loads(requests.get(URL).text)  # retrieving the content as a JSON object
    results += page['results']  # adding the 'results' section to our stored list
    if page['next']:  # if there is a 'next' field of results noted
        URL = page['next']  # then make this our new URL for when the loop repeats
    else:  # if there is no longer a 'next' field
        keepgoing = False  # then we have reached the end and can terminate the loop
        print('Data load complete.')  # message to the user

Data loading begins...
https://gutendex.com/books?topic=banned
https://gutendex.com/books/?page=2&topic=banned
https://gutendex.com/books/?page=3&topic=banned
https://gutendex.com/books/?page=4&topic=banned
https://gutendex.com/books/?page=5&topic=banned
https://gutendex.com/books/?page=6&topic=banned
Data load complete.


Despite being a semi-structured datasource, we can still process our results as a Pandas dataframe, if only to test it has worked correctly. It will be a little messy, given some fields are lists or lists of dictionaries, but we can see the 181 texts below:

In [None]:
import pandas as pd
bannedbooks = pd.DataFrame(results)
bannedbooks

## 4. Transforming Semi-Structured Datasets

If we are to answer meaningful questions on our dataset, our best case scenario would be achieving a structured representation of the data model, which we can query.

### 4.1 Spinning Off Entities

For some columns, this is easy. The following columns are simple values from our main dataframe without further nested depth, so we'll store these as our `booksdf` object.

In [None]:
booksdf = bannedbooks[['id', 'title', 'copyright', 'media_type', 'download_count']]
booksdf

Other fields, however, are more complex, such as "subjects". As a simple list, this would be best spun out into one row for each subject value. We can achieve this using Pandas' dramatically-named `explode()` function, and store this as another dataframe for now - `subjectsdf`.

In [None]:
subjectsdf = bannedbooks[['id', 'subjects']]
subjectsdf = subjectsdf.explode('subjects')
subjectsdf

### 4.2 Complex Entity Transformations

Fields such as "authors" are even more complex again. Recall each value here is a dictionary, which contains a list of authors (potentially more than one), so we can begin by spinning this off into one row per author of each book, like the "subjects" approach above. This increases our row count from 181 to 187, so there are a handful of books with multiple attributed authors. We'll store this as `authorsdf`, though each row still contains a dictionary of information for each author.

In [None]:
authorsdf = bannedbooks[['id', 'authors']]
authorsdf = authorsdf[['id', 'authors']].explode('authors').reset_index(drop=True)
authorsdf

It's worth pointing out that the Pandas transformations shown below are **purely for our benefit, and not examinable functions you need to remember**. Don't stress about the next couple of code blocks - they are not super important but explained for transparency.

The "authors" column really should be spun out into three separate columns, one for each attribute (birth_year, death_year, name). Pandas considers the data type of this column to be a Series, which is necessary knowledge to enable the split.

In [None]:
type(authorsdf.authors)

Using the `apply` function, we can split it out into one column for each attribute, and then join it back in-place of the original column, to produce a much more friendly output - our final version of `authorsdf`.

In [None]:
authorfields = authorsdf['authors'].apply(pd.Series).drop(0, axis=1)
authorsdf = authorsdf.join(authorfields).drop('authors', axis=1).reset_index(drop=True)
authorsdf

One final entity we'll spin off - the "formats". Code is condensed below since not important, but to summarise, it acts similarly to "authors", but doesn't need to expect a list of dictionaries, simply a single dictionary (hence no `.explode()` function). It does, however, require one extra transformation using `.melt()` so that we don't have one column per format, but rather a simple key-value table.

Again - not super important, just for those interested in the magic behind the scenes! What's important is that now we have **four, much more friendly, structured tables** for our dataset.

In [None]:
formatsdf = bannedbooks[['id', 'formats']]
formatfields = formatsdf['formats'].apply(pd.Series)
formatsdf = formatsdf.join(formatfields).drop('formats', axis=1).reset_index(drop=True)
formatsdf = formatsdf.melt(id_vars=['id']).dropna(subset=['value'])
formatsdf.rename(columns={'variable': 'format', 'value': 'link'}, inplace=True)
formatsdf

Summarising the dimensions of our four new cleaned tables for a sense of satisfaction:

In [None]:
print('Books:', booksdf.shape)
print('Subjects:', subjectsdf.shape)
print('Authors:', authorsdf.shape)
print('Formats:', formatsdf.shape)

### 4.3 Data Import

The below functions should be quite familiar by now - since we've done the hard yards, let's again import it into our localhost database and run a few queries to see if we can manage some interesting findings.

Recall also this depends on your individual `Credentials.json` from previous weeks existing in your directory again!

In [None]:
from sqlalchemy import create_engine, text
import psycopg2
import psycopg2.extras
import json
import os
import pandas as pd

credentials = "Credentials.json"

def pgconnect(credential_filepath, db_schema="public"):
    with open(credential_filepath) as f:
        db_conn_dict = json.load(f)
        host       = db_conn_dict['host']
        db_user    = db_conn_dict['user']
        db_pw      = db_conn_dict['password']
        default_db = db_conn_dict['user']
        try:
            db = create_engine('postgresql+psycopg2://'+db_user+':'+db_pw+'@'+host+'/'+default_db, echo=False)
            conn = db.connect()
            print('Connected successfully.')
        except Exception as e:
            print("Unable to connect to the database.")
            print(e)
            db, conn = None, None
        return db,conn

def query(conn, sqlcmd, args=None, df=True):
    result = pd.DataFrame() if df else None
    try:
        if df:
            result = pd.read_sql_query(sqlcmd, conn, params=args)
        else:
            result = conn.execute(text(sqlcmd), args).fetchall()
            result = result[0] if len(result) == 1 else result
    except Exception as e:
        print("Error encountered: ", e, sep='\n')
    return result

In [None]:
db, conn = pgconnect(credentials)

Let's create a new schema for our data, simply entitled "Books".

In [None]:
conn.execute(text("create schema if not exists Books"))
conn.execute(text("set search_path to Books"))

From there, we'll create tables for our three spun off datasets, and populate them from Pandas as below. This process should be quite familiar to you by now - we'll check it worked by selecting all from "Authors" at the end.

In [None]:
conn.execute(text("""
DROP TABLE IF EXISTS Subjects;
DROP TABLE IF EXISTS Authors;
DROP TABLE IF EXISTS Formats;

CREATE TABLE Subjects(
   id int,
   subjects varchar(1000)
);
CREATE TABLE Authors(
   id int,
   birth_year int,
   death_year int,
   name varchar(100)
);
CREATE TABLE Formats(
   id int,
   format varchar(100),
   link text
);
"""))
subjectsdf.to_sql("subjects", con=conn, if_exists='append', index=False)
authorsdf.to_sql("authors", con=conn, if_exists='append', index=False)
formatsdf.to_sql("formats", con=conn, if_exists='append', index=False)
query(conn, "select * from Authors")

**Task: Create a table for Books and populate it.**

Using our `booksdf` from above, this is the only one of our four that now isn't populated in our localhost database. Similarly to the examples above, define a "Books" table using a `CREATE TABLE` command, then populate it with our Pandas dataframe.

In [None]:
### TO DO
conn.execute(text("""
DROP TABLE IF EXISTS Books;
CREATE TABLE Books(
   id int primary key,
   title varchar(500),
   copyright boolean,
   media_type varchar(50),
   download_count int
);"""))
booksdf.to_sql("books", con=conn, if_exists='append', index=False)
query(conn, "select * from Books")

### 4.4 Data Querying

**Task: Attempt the four SQL queries below.**

**a) Find all books attributed to authors *without a birth year*, sorted in alphabetical order.**

In [None]:
### TO DO
sql = """
select b.title, b.download_count, a.*
from Books b
join Authors a using (id)
where birth_year is null
order by title
"""
query(conn, sql)

**b) Find the top 5 authors with the most number of banned books.**

In [None]:
### TO DO
sql = """
select a.name, count(*)
from Books b
join Authors a using (id)
group by a.name
order by count(*) desc
limit 5
"""
query(conn, sql)

**c) Produce a list of all 'audio' formats available for texts in our dataset.**

In [None]:
### TO DO
sql = """
select b.title, b.download_count, f.format, f.link
from Books b
join Formats f using (id)
where format like 'audio/%%'
"""
query(conn, sql)

**d) Which subject containing at least 5 banned books has the highest average download count?**

In [None]:
### TO DO
sql = """
select s.subjects, avg(b.download_count)
from Books b
join Subjects s using (id)
group by s.subjects
having count(*) >= 5
order by avg(b.download_count) desc
"""
query(conn, sql)

#### Challenge: Feel free to investigate other account-based APIs!

That concludes this week's content - but do consider the next step of APIs, which is those that require authentication to access data. Most platforms, big and small, are structured this way (locally, examples like the NSW Government, and globally, tech giants like YouTube and OpenAI). Feel free to explore further with examples like these!