### Extracting Data from Trove APIs

#### Importing Libraries

At the top, we usually import libraries that we are going to be using. For this example we are using 3 libraries:
1. [Pandas](https://pandas.pydata.org/) (https://pandas.pydata.org/)

   Pandas is a data analysis library. Even if you are not using it for data analysis, there are lots of tools and utilities that are built in that can prove to be highly beneficial, e.g. it can easily read JSON or CSV data and covert them into Python objects, it can easily output JSON or CSV files from data strutures that you've built/modified as well. Pandas might need to be installed separately unless you've used an installation package like Anaconda.

   To install, simply type on a command line:
   ```
   pip install pandas
   ```

1. [Requests: HTTP for Humans](http://docs.python-requests.org/en/master/) (http://docs.python-requests.org/en/master/)

   Requests is a library that makes working with the web almost trivial. It eliminates much of the code you would otherwise have to write if you were to make HTTP requests and deal with responses. Requests might need to be installed separately unless you've used an installation package like Anaconda.

   To install, simply type on a command line:
   ```
   pip install requests
   ```

1. [JSON](https://docs.python.org/3/library/json.html) (https://docs.python.org/3/library/json.html)

   Unlike the previous two libraries, JSON is built into Python. The JSON library allows Python to convert JSON objects to Python objects and vice versa.

In [None]:
import json
import pandas as pd
import requests

#### Setting up some constants and global variables

In [None]:
QUERY="erasmus luther"
BASE_URL = "https://api.trove.nla.gov.au/v2/result?key=[enter TROVE API Key Here]&zone=book&encoding=json&n=100&q=" + QUERY
session = requests.Session()
dataset = []

#### Defining functions
Here we have a function that fetches a page of data for us from Trove. It returns __nextStart__ and __work__. If __nextStart__ is __None__, that means that we have no more pages to fetch. __work__ is a list of all the entries of our search result that are on the fetched page.

In [None]:
def fetch_page_data(nextStart = None):
    suffix = "&s=" + nextStart if nextStart is not None else ""
    response = session.get(BASE_URL + suffix)
    data = response.json()
    if "nextStart" in data["response"]["zone"][0]["records"]:
        nextStart = data["response"]["zone"][0]["records"]["nextStart"]
    else:
        nextStart = None
    
    return nextStart, data["response"]["zone"][0]["records"]["work"]

#### Fetch the data
Here we fetch the data. We make an initial call to fetch_page_data(). The nextStart parameter of the __fetch_page_data__ function is empty as we are fetching the first page and do not as yet have one. We store the result of the function into two variables:
1. __nextStart__: this is the token of the next page in our result set.
1. __work__: this sets up the variable for holding the list of our returned results.

We then start the loop. If __nextStart__ is None, that means there is no next page and we are done. If __nextStart__ contains a value, then we need to fetch the data on the next page.

In [None]:
nextStart, work = fetch_page_data()
print("fetched nextStart: {}, size: {}".format(None, len(work)))
while nextStart is not None:
    nextStart, dataset = fetch_page_data(nextStart)
    print("fetched nextStart: {}, size: {}".format(nextStart, len(dataset)))
    work += dataset

print("Number of records retrieved: {}".format(len(work)))

#### Prepare the data
Convert the data we have retrieved into a JSON string, and pass that string to Pandas which can use it to create a DataFrame for us. This is an easy method to create a DataFrame from JSON. We can have a look at the first 5 rows of the DataFrame.

In [None]:
trove_ids = pd.read_json(json.dumps(work)).fillna("")
trove_ids.head()

#### Export the data
We can now export the data into a CSV file. Pandas again makes this very easy to do. We specify the __to_csv__ method of the DataFrame and provide:
1. The name of the CSV file we want to export to.
1. _index=False_ tells Pandas not to export the row number of the records.
1. _columns_ tells Pandas which columns from the DataFrame we want to export. We do not have to export all. We can also select the order of the columns. In our case we just want the __Trove ID__, __Trove URL__ and __Title__.

In [None]:
trove_ids.to_csv("trove-data.csv",
                 index=False,
                 columns=['id', 'troveUrl', 'title'])