# Sample Python Application
## Extracting Course Handbook Information

### The Handbook
The University Wiki contains lots of intersting information. Within this forest of knowledge, you can find instructions on how to use the University's Course Handbook API.

The URL: [Handbook API](https://wiki.mq.edu.au/display/webstrategy/APIs+and+System+Dependencies) (https://wiki.mq.edu.au/display/webstrategy/APIs+and+System+Dependencies)

### Our Example
For this example we are going to use the Course Handbook API to extract information about Courses. We are going to look at creating a CSV file with the following data:
1. Code
1. Department
1. Faculty
1. Name
1. Offering
1. Description

### Exploring the Code

#### We are going to be breaking down the following Python script:
```python
import pandas as PD
import requests
import json

data_url = "http://api.prod.handbook.mq.edu.au/Units/JSON/2019/9f9ef28dea630ae6311cc730207b2b59"
unit_url = "http://api.prod.handbook.mq.edu.au/Unit/JSON/{}/2019/9f9ef28dea630ae6311cc730207b2b59"

s = requests.Session()
r = s.get(data_url)
data = PD.read_json(r.text)
data.head()

for index, row in data.iterrows():
    url = unit_url.format(row["Id"])
    r = s.get(url)
    
    if r.status_code == 200:
        print("processing code: {}".format(row["Code"]))
        unit_data = json.loads(r.text)
        
        if len(unit_data["UnitOfferings"]) == 0:
            continue
            
        if "code" in unit_data["UnitOfferings"][0]:
            data.at[index, "Offering"] = unit_data["UnitOfferings"][0]["code"]

        if "description" in unit_data["UnitOfferings"][0]:
            data.at[index, "Description"] = unit_data["UnitOfferings"][0]["description"]

data.fillna("")
data.head()
data.to_csv("unit-codes-2019.csv",
            index=False,
            columns=['Code', 'Department', 'Faculty', 'Name', 'Offering', 'Description'])
```


#### Importing Libraries

At the top, we usually import libraries that we are going to be using. For this example we are using 3 libraries:
1. [Pandas](https://pandas.pydata.org/) (https://pandas.pydata.org/)

   Pandas is a data analysis library. Even if you are not using it for data analysis, there are lots of tools and utilities that are built in that can prove to be highly beneficial, e.g. it can easily read JSON or CSV data and covert them into Python objects, it can easily output JSON or CSV files from data strutures that you've built/modified as well. Pandas might need to be installed separately unless you've used an installation package like Anaconda.

   To install, simply type on a command line:
   ```
   pip install pandas
   ```

1. [Requests: HTTP for Humans](http://docs.python-requests.org/en/master/) (http://docs.python-requests.org/en/master/)

   Requests is a library that makes working with the web almost trivial. It eliminates much of the code you would otherwise have to write if you were to make HTTP requests and deal with responses. Requests might need to be installed separately unless you've used an installation package like Anaconda.

   To install, simply type on a command line:
   ```
   pip install requests
   ```

1. [JSON](https://docs.python.org/3/library/json.html) (https://docs.python.org/3/library/json.html)

   Unlike the previous two libraries, JSON is built into Python. The JSON library allows Python to convert JSON objects to Python objects and vice versa.

In [None]:
# Importing the required libraries
import pandas as PD
import requests
import json

#### Setting up some constants
In this next section we set up some constants. Constants are pieces of data that never change.

##### Complete Course List: data_url
First we setup the URL for where we are going to get our complete list of courses from for 2019. The URL is a copy/paste from the Handbook API page with a change of year to 2019.

The resulting dataset should look like the snippet below:
```javascript
[
    {
        "Code": "ABEC313",
        "Department": "Department of Educational Studies",
        "Faculty": "Faculty of Human Sciences",
        "Id": "40321",
        "ModifiedDate": "2018-05-28 09:33am",
        "Name": "Early Development 2",
        "Status": "Green",
        "University": false,
        "Version": "2"
    },
    {
        "Code": "ABEP330",
        "Department": "Department of Educational Studies",
        "Faculty": "Faculty of Human Sciences",
        "Id": "18",
        "ModifiedDate": "2018-05-28 09:47am",
        "Name": "Program Planning in ATSI Contexts",
        "Status": "Green",
        "University": false,
        "Version": "2"
    }
]
```

##### Course Details: unit_url
The second constant we setup is a URL for obtaining comprehensive data about a specific course. From the snippet above, you can see that each course contains an __Id__. To get the comprehensive data about a course, we will have to use the __unit_url__. However, note the __{}__ in the URL. This is our placeholder for specifying which course we want to get the detailed information about. We need to replace the placeholder with the __Id__ of the course we want to get the detailed information about. This would look like:
`http://api.prod.handbook.mq.edu.au/Unit/JSON/40321/2019/9f9ef28dea630ae6311cc730207b2b59`

Further down, you will see the following line:

```python
url = unit_url.format(row["Id"])
```

This is replacing the placeholder __{}__ with the value of __row["Id"]__. This is part of string formatting. There is a fantastic writeup explaning how this works [here](https://pyformat.info/) (https://pyformat.info/).

In [None]:
# declare constants
data_url = "http://api.prod.handbook.mq.edu.au/Units/JSON/2019/9f9ef28dea630ae6311cc730207b2b59"
unit_url = "http://api.prod.handbook.mq.edu.au/Unit/JSON/{}/2019/9f9ef28dea630ae6311cc730207b2b59"

#### Fetching the first dataset - a complete list of courses with a summary of information about each

When using the requests library, it is always best to use a session. There are a number of reasons for this, but the simplest is that it is more efficient. The two options are:
1. ```python
   r = requests.get(url)
   ```
or
2. ```python
   session = requests.Session()   
   r = session.get(url)
   ```
The extra line in 2. is well worth it for the speed increase you will get when making repeated calls to a Web API.

The `session.get(url)` fetches the data at the url provided. This is roughly the same as typing in a URL in a web browser. The data that comes back is the same data that will be placed in the response object, which we've called __r__.

In [None]:
# Create a session for efficiency and call it 's'
s = requests.Session()

# Make a web request and store the response in 'r' after
# which 'r' will have the data that we have fetched.
r = s.get(data_url)


# You can access the data by using 'r.text'. We take the text data,
# and get Pandas to read it as JSON (which it is) and then convert
# it into a Pandas DataFrame (think Excel Worksheet)

data = PD.read_json(r.text)

Lets look at what we now have in the data frame '__data__':

In [None]:
data.head()

#### Obtaining course details

Once we have the list of courses, we want to enrich the data with additional information - namely the offering and description. To do this, we need to make a Web API request for each of the courses that are in our __data__ worksheet. We need to go row by row, and use the __Id__ field to make a Web API request for each course using the __unit_url__.

Once we get the data for each course, we need to add it to our __data__ worksheet - simply adding a column with the appropriate data.

The __for__ loop allows us to perform tasks across a set of rows of data. Each iteration of the __for__ loop will process the next row of data in sequence. A good explanation of __for__ loops can be found [here](https://www.programiz.com/python-programming/for-loop). Let's examine how this works in this example.

The following is often considered a __for each__ style loop. It essentially means that for each of the rows in the DataFrame __data__, extract the index and the row itself and place them into the variables __index__ and __row__ respectively. Then execute the code block directly below it.

```python
for index, row in data.iterrows()
```

The __row__ is just a copy of the row in the data frame so changing the row doesn't affect the DataFrame data at all. That's why we also need to use the __index__ when we want to change data in the data frame. The index keeps track of which row we are currently working on. We can change a cell in that row by directly referencing it in the data frame by using:
```python
data.at[index, "column name"] = "New Value"
```

That is basically accessing a specific row and a specific column in the data frame (kind of like a cell in an Excel worksheet).

In [None]:
# loop over each of the rows in the data frame
for index, row in data.iterrows():
    # build a URL using the unit_url template and replacing the placeholder with the course Id
    url = unit_url.format(row["Id"])
    # fetch the data
    r = s.get(url)
    
    # 200 means that the fetch was successful and we only proceed if it is 200
    if r.status_code == 200:
        # output some information for the user running this code
        print("processing code: {}".format(row["Code"]))
        
        # load the response text into a Python object
        unit_data = json.loads(r.text)
        
        # check if we have UnitOfferings in the course details and if not, move on to the next row.
        if len(unit_data["UnitOfferings"]) == 0:
            continue
            
        # check if we have a 'code' field in the UnitOfferings and if so
        # add it to the data frame row we are working with
        if "code" in unit_data["UnitOfferings"][0]:
            data.at[index, "Offering"] = unit_data["UnitOfferings"][0]["code"]

        # same as above but with description
        if "description" in unit_data["UnitOfferings"][0]:
            data.at[index, "Description"] = unit_data["UnitOfferings"][0]["description"]

#### Cleaning up the data and checking what we have
At this point we should have completed building our dataset. We just need to clean it up a little because when we add new columns to the data frame the default value is a special _missing value_ called __NaN__. We replace all these __NaN__ values with empty strings to make things look a lot neater by calling `data.fillna("")`.

Then we display a sample.

In [None]:
data.fillna("")
data.head()

#### Saving the CSV file
Again, Pandas makes this trivial. The line below creates a CSV file for us and populates it with our data frame. However, we can also select the columns that we want to output if there are extra ones that we don't want in our output. We pass in a list of column names that we want to output. The `index=False` part tells Pandas that we don't want the row numbers in the output. If we omit this the first column will be row numbers.

In [None]:
data.to_csv("unit-codes-2019.csv",
            index=False,
            columns=['Code', 'Department', 'Faculty', 'Name', 'Offering', 'Description'])