# Workshop 8 Overview

## Case Study Background
Today, we'll be working with `XML` and `JSON` file formats. An example of what the corresponding DataFrame looks like is given below:

|       | cinema_id | film_id | session_type | date     | start_time |
|-----------|---------|-------------|----------|------------|-------|
| 0         | 8941    | 227902      | standard | 2018-09-14 | 14:30 |
| 1         | 8941    | 227902      | VMAX     | 2018-09-15 | 15:45 |
| 2         | 8941    | 123456      | standard | 2018-09-15 | 17:05 |
| 3         | 8941    | 123456      | VMAX     | 2018-09-16 | 19:25 |
| 4         | 8941    | 123456      | VMAX     | 2018-09-17 | 18:00 |

Please note that the first JSON example does **not** have the `session_type` feature. 

## Learning objectives
Become proficient in manipulating `XML` and `JSON` data structures:
- Learn about parsing and working with `XML` files.
- Understand how to traverse an `XML` data structure. 
- Learn about parsing and working with `JSON` files. 


## Workshop Overview
- Read in `XML` files 
- Learning to traverse and understand how to interpret the `XML` data structure
- How to create and output `XML` files
- Parse `JSON` files
- Modify and change the contents of `JSON` files
- How to create and output `JSON` files

In [1]:
# new libraries for today
import json
from lxml import etree

# pandas
import pandas as pd

# <u>Concept: XML</u>
## Why XML and when do we see it?
- Extensible Markup Language (`XML`) is widely used **markup language** used to define rules for encoding documents or data structures. It is much closer to `HTML` than it is to Python.
- If `HTML` is used to generate and display websites, then `XML` is used in a similar context to generate and display *data*. 
- `HTML` and `XML` both share the concept of tags (i.e opening and closing tags using `<` and `>`)

## XML and Python
- To parse XML data structures in Python, we will use the `lxml` library and not the built-in `xml` library.
- This is because `lxml` is a more powerful and feature-rich version of `xml`.
- Notable functions from `lxml` include `etree`, which allows parsing of XML data into a tree-like structure.
- Documentation: https://lxml.de/api/index.html

Recall how we used `open()` to read in text files:
```python
# Method 1
with open("some filename.txt", "r") as f:
    data = f.read()

# Method 2
f = open("some filename.txt", "r")
data = f.read()
...
f.close()
```

We do the same for `XML` files. Below, we read `sample_xml.xml` and print it out. Note that we are printing it below *as a string* (`f.read()`) and it is not an `XML` object yet.

In [2]:
# Read file and print out (haven't parsed into a Python object)
with open("sample_xml.xml", "r") as f:
    print(f.read())

<?xml version="1.0"?>
<cinema cinema_id="8941" cinema_name="Vue Cinemas - Reading">
  <showings film_id="227902" film_name="The Predator">
    <standard>
      <start_time>14:30</start_time>
      <date>2018-09-14</date>
    </standard>
    <VMAX>
      <start_time>15:45</start_time>
      <date>2018-09-15</date>
    </VMAX>
  </showings>
  <showings film_id="123456" film_name="Avengers">
    <standard>
      <start_time>17:05</start_time>
      <date>2018-09-15</date>
    </standard>
    <VMAX>
      <start_time>19:25</start_time>
      <date>2018-09-16</date>
    </VMAX>
    <VMAX>
      <start_time>18:00</start_time>
      <date>2018-09-17</date>
    </VMAX>
    <standard>
      <start_time>21:05</start_time>
      <date>2018-09-17</date>
    </standard>
    <VMAX>
      <start_time>10:05</start_time>
      <date>2018-09-18</date>
    </VMAX>
  </showings>
</cinema>


Here's a  *visual* representation of the `XML` (made using https://codebeautify.org/xmlviewer):


<img align="left" src="xml.png" alt="">

It's good to note here that the `XML` has a hierarchical structure like trees in computing. The **root** node is `cinema`, with 2 child nodes called `showings`, and so on.

Now that we've seen what it looks like, let's work with the `lxml` library.

In [3]:
# Parse into a Python object called ETree
xmltree = etree.parse('sample_xml.xml')

# Get the root node
root = xmltree.getroot()

If we refer to the `xml` output above, our root node should be `cinema`.

`<cinema cinema_id="8941" cinema_name="Vue Cinemas - Reading">`

Let's go through the tags, attributes, text, and sub-elements.

In [4]:
# the name of the tag
print("Tag:", root.tag) 

# the given attributes in a dictionary-like format
print("Attributes:", root.attrib) 

# any text - there is nothing here as there is no text between the opening and closing tags
print("Content:", root.text) 

# the number of sub-elements or children below cinema
# if we look at the example above, we should see 2 "showings"
print("How many sub-elements/children:", len(root))

Tag: cinema
Attributes: {'cinema_id': '8941', 'cinema_name': 'Vue Cinemas - Reading'}
Content: 
  
How many sub-elements/children: 2


Let's go through some examples:

In [5]:
# Get a specific attribute. This works like dict.get()
root.get('cinema_id')

'8941'

In [6]:
# Note that this method will return None by default if you're trying to access an attribute that's not there.
# Consider this behaviour when you're writing a loop to access the attributes.
# You can specify a different return value (i.e False) like dict.get()
root.get('some_attribute_that_is_not_there', False)

False

Let's work with the sub-elements. Hopefully you can see that we are to have 2 child nodes with tag name `showing`.

We index it like we would a list (starting from 0). This means, we should have index `0` and `1` to go through.

In [7]:
# How to index the first child. 
first_showings = root[0]
print(first_showings)
print("Tag:", first_showings.tag)
print("Attributes:", first_showings.attrib)

<Element showings at 0x7f18dc29af00>
Tag: showings
Attributes: {'film_id': '227902', 'film_name': 'The Predator'}


In [8]:
# How to index the second child. 
second_showings = root[1]
print(second_showings)
print("Tag:", second_showings.tag)
print("Attributes:", second_showings.attrib)

<Element showings at 0x7f18d5675240>
Tag: showings
Attributes: {'film_id': '123456', 'film_name': 'Avengers'}


In [9]:
# Like normal lists, you'll get an index error if the child does not exist
third_showings = root[2]

IndexError: list index out of range

Continuing on, let's work with the **first showing**. From the output above, this is what it looks like:
```xml
<showings film_id="227902" film_name="The Predator">
    <standard>
        <start_time>14:30</start_time>
        <date>2018-09-14</date>
    </standard>
    <VMAX>
        <start_time>15:45</start_time>
        <date>2018-09-15</date>
    </VMAX>
</showings>
```

We can see the `showings` node as:
1. a `standard` screening.
2. a `VMAX` screening.
3. Both screenings have a `start_time` and `date`.

In [10]:
# Get first child of a specific tag
vmax = first_showings.find("VMAX")
print(vmax)
print("Tag:", vmax.tag)
print("Attributes:", vmax.attrib)

# which line this tag appears on (according to the first xml output above)
print("Line number on which this tag appears:", vmax.sourceline)

<Element VMAX at 0x7f18d55a8c80>
Tag: VMAX
Attributes: {}
Line number on which this tag appears: 8


We can also loop over all sub-elements/child nodes by using the `iterchildren` and `iterdescendants` methods:

In [11]:
# Get all children of a specific tag - in this case, VMAX
for vmax_session in first_showings.iterchildren(tag='VMAX'):
    print(vmax_session.find('date').text)

2018-09-15


Here, you can see the `.text` method working as we have text between the `date` opening and closing tags. Specifically:
```xml
<date>2018-09-15</date>
```

In [12]:
# Get all descendants
for vmax_session in root.iterdescendants(tag='VMAX'):
    print(vmax_session.find('date').text)

2018-09-15
2018-09-16
2018-09-17
2018-09-18


**Questions:**
- What's the different between `.iterchildren()` and `.iterdescendants()`?
    - `.iterchildren()` goes through all child nodes at the next level, whilst `.iterdescendants()` goes through all the child nodes across the lower levels.
- What is the *prolog* in the `XML`. Is this always required?
    - The prolog in XML shows us some metadata about the encoding and version of XML. It's optional, but good to have for best practice. 
- Why were "start_time" and "date" created as sub-elements instead of an attribute of VMAX / standard?
    - There are actually no rules about when to use attributes or when to create sub-elements. Essentially, you should be using attributes for "describing" the data and use sub-elements for the actual data. Since "start_time" and "date" are closer to _data_ than _metadata_ (data that describes data), we prefer sub-elements. It is also very fair to say that if you use attributes as a way of storing data, you end up with XML files that are difficult to read and maintain.

Now, let's work on *adding* values to `XML`.

As an example, let's add a new cinema with `"cinema_id": "8932"`, `"cinema_name": "Another Cinema"`, with no showings.

In [13]:
# First we need to create the new cinema as an Element object
new_cinema = etree.Element('cinema')
new_cinema.set("cinema_id", "8932")
new_cinema.set("cinema_name", "Another Cinemas")

# This is a function to "preview" the Element object as a string
print(etree.tostring(new_cinema, # the etree element to show
                     pretty_print=True, # if we want to nicely format the xml with indentation
                     encoding='unicode') # ensure we use unicode
)

<cinema cinema_id="8932" cinema_name="Another Cinemas"/>



We then have to create a new **root**. This is because we can't have 2 <cinema> elements at the root level. 

**Question: Why can't we have 2 elements at the root level?**

In [14]:
# create a new root node called CinemaList
new_root_node = etree.Element('cinemaList')

# We now copy the 2 <cinema> elements to this <cinemaList> element
# Here, we can append it like we would to a list
new_root_node.append(root) # the original root node
new_root_node.append(new_cinema) # the new element

In [15]:
# Preview our new tree
print(etree.tostring(new_root_node,
                     pretty_print=True,
                     encoding='unicode')
 )

<cinemaList>
  <cinema cinema_id="8941" cinema_name="Vue Cinemas - Reading">
  <showings film_id="227902" film_name="The Predator">
    <standard>
      <start_time>14:30</start_time>
      <date>2018-09-14</date>
    </standard>
    <VMAX>
      <start_time>15:45</start_time>
      <date>2018-09-15</date>
    </VMAX>
  </showings>
  <showings film_id="123456" film_name="Avengers">
    <standard>
      <start_time>17:05</start_time>
      <date>2018-09-15</date>
    </standard>
    <VMAX>
      <start_time>19:25</start_time>
      <date>2018-09-16</date>
    </VMAX>
    <VMAX>
      <start_time>18:00</start_time>
      <date>2018-09-17</date>
    </VMAX>
    <standard>
      <start_time>21:05</start_time>
      <date>2018-09-17</date>
    </standard>
    <VMAX>
      <start_time>10:05</start_time>
      <date>2018-09-18</date>
    </VMAX>
  </showings>
</cinema>
  <cinema cinema_id="8932" cinema_name="Another Cinemas"/>
</cinemaList>



Finally, to write to an `XML` file, use the `.write()` method. It's important to note that we use `wb` to write in *binary* mode. 

In [16]:
# Write to a new xml file
new_tree = etree.ElementTree(new_root_node)

with open('cinemaList.xml', 'wb') as f:
    new_tree.write(f, # file to write to
                   xml_declaration=True # to add the prolog
                  )

## Individual Exercise 1
Extract a dataframe that consists of `cinema_id`, `film_id`, `session_type`, `date`, `start_time`. 

There are 2 types of `session_type`: 
- VMAX and standard; 

where `date` and `start_time` depends on `session_type`.

Think about the attributes and methods stated above.

Here's the output to match so you can check your answer:
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>cinema_id</th>
      <th>film_id</th>
      <th>session_type</th>
      <th>date</th>
      <th>start_time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>8941</td>
      <td>227902</td>
      <td>standard</td>
      <td>2018-09-14</td>
      <td>14:30</td>
    </tr>
    <tr>
      <th>1</th>
      <td>8941</td>
      <td>227902</td>
      <td>VMAX</td>
      <td>2018-09-15</td>
      <td>15:45</td>
    </tr>
    <tr>
      <th>2</th>
      <td>8941</td>
      <td>123456</td>
      <td>standard</td>
      <td>2018-09-15</td>
      <td>17:05</td>
    </tr>
    <tr>
      <th>3</th>
      <td>8941</td>
      <td>123456</td>
      <td>VMAX</td>
      <td>2018-09-16</td>
      <td>19:25</td>
    </tr>
    <tr>
      <th>4</th>
      <td>8941</td>
      <td>123456</td>
      <td>VMAX</td>
      <td>2018-09-17</td>
      <td>18:00</td>
    </tr>
    <tr>
      <th>5</th>
      <td>8941</td>
      <td>123456</td>
      <td>standard</td>
      <td>2018-09-17</td>
      <td>21:05</td>
    </tr>
    <tr>
      <th>6</th>
      <td>8941</td>
      <td>123456</td>
      <td>VMAX</td>
      <td>2018-09-18</td>
      <td>10:05</td>
    </tr>
  </tbody>
</table>

In [17]:
### SOLUTION 
df_rows = []

cinema_id = root.get('cinema_id')

for film in root.iterchildren(tag='showings'):
    film_id = film.get('film_id')
    
    # Loop through each show time
    for show in film.iterchildren():
        session_type = show.tag
        start_time = show.find('start_time').text
        date = show.find('date').text
        
        data = {'cinema_id': cinema_id,
                'film_id': film_id,
                'session_type': session_type,
                'date': date,
                'start_time': start_time
        }
        
        df_rows.append(data)
            
df = pd.DataFrame(df_rows)
df.head()

Unnamed: 0,cinema_id,film_id,session_type,date,start_time
0,8941,227902,standard,2018-09-14,14:30
1,8941,227902,VMAX,2018-09-15,15:45
2,8941,123456,standard,2018-09-15,17:05
3,8941,123456,VMAX,2018-09-16,19:25
4,8941,123456,VMAX,2018-09-17,18:00


# <u>Concept: JSON</u>
- `JSON` (JavaScript Object Notation) is another common data structure which is supposed to replace the `XML` data structure.
- Works very similar to a Python dictionary.
- To parse and read `json` files, we can use the `json` library.
- Documentation: https://docs.python.org/3/library/json.html
- Tutorial: https://www.w3schools.com/python/python_json.asp

As an FYI, if you are unsure about the differences between `Java` and `JavaScript`, think of it like this:
- *Java is to JavaScript as Car is to Carpet*

## Reading in JSON files
- `json.load()` loads a `JSON` object.
- `json.loads()` loads a `JSON` from string (i.e `json.loadSTRING()`, kind of an ambiguous naming convention)

To read in a `JSON` object, we'll need to use `open()` and `json.load(JSON_OBJECT)`.

In [18]:
with open('sample_api.json', 'r') as f:
    sample_api = json.load(f)
    
sample_api

{'cinema': {'cinema_id': 8941,
  'cinema_name': 'Vue Cinemas - Reading',
  'showings': [{'film_id': 227902,
    'film_name': 'The Predator',
    'times': [{'start_time': '14:30', 'date': '2018-09-14'},
     {'start_time': '15:45', 'date': '2018-09-15'}]},
   {'film_id': 123456,
    'film_name': 'Avengers',
    'times': [{'start_time': '17:05', 'date': '2018-09-15'},
     {'start_time': '19:25', 'date': '2018-09-16'},
     {'start_time': '18:00', 'date': '2018-09-17'},
     {'start_time': '21:05', 'date': '2018-09-17'},
     {'start_time': '10:05', 'date': '2018-09-18'}]}]}}

Now, let's use `.loads()`.

**IMPORTANT**: `JSON` requires double quotes for the keys/values.
For example: `{'key': 'value'}` (incorrect) vs `{"key": "value"}` (correct)

In [19]:
str(sample_api)

"{'cinema': {'cinema_id': 8941, 'cinema_name': 'Vue Cinemas - Reading', 'showings': [{'film_id': 227902, 'film_name': 'The Predator', 'times': [{'start_time': '14:30', 'date': '2018-09-14'}, {'start_time': '15:45', 'date': '2018-09-15'}]}, {'film_id': 123456, 'film_name': 'Avengers', 'times': [{'start_time': '17:05', 'date': '2018-09-15'}, {'start_time': '19:25', 'date': '2018-09-16'}, {'start_time': '18:00', 'date': '2018-09-17'}, {'start_time': '21:05', 'date': '2018-09-17'}, {'start_time': '10:05', 'date': '2018-09-18'}]}]}}"

In [20]:
json.loads(str(sample_api))

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

In [21]:
# replace all single quotes with double quotes
fixed_string_json = str(sample_api).replace('\'','"')
json.loads(str(fixed_string_json))

{'cinema': {'cinema_id': 8941,
  'cinema_name': 'Vue Cinemas - Reading',
  'showings': [{'film_id': 227902,
    'film_name': 'The Predator',
    'times': [{'start_time': '14:30', 'date': '2018-09-14'},
     {'start_time': '15:45', 'date': '2018-09-15'}]},
   {'film_id': 123456,
    'film_name': 'Avengers',
    'times': [{'start_time': '17:05', 'date': '2018-09-15'},
     {'start_time': '19:25', 'date': '2018-09-16'},
     {'start_time': '18:00', 'date': '2018-09-17'},
     {'start_time': '21:05', 'date': '2018-09-17'},
     {'start_time': '10:05', 'date': '2018-09-18'}]}]}}

## Individual Exercise 2
Create dataframe with variable name `df` that consists of `cinema_id`, `film_id`, `date`, `start_time` from the given `JSON` file.

`JSON` works pretty much the same as Python dictionaries, so this would be the equivalent of making a Python dictionary into a pandas DataFrame.

In [22]:
df_rows = []

cinema_id = sample_api['cinema']['cinema_id']
for film in sample_api['cinema']['showings']:
    film_id = film['film_id']
    
    # Loop through each show time
    for show in film['times']:
        start_time = show['start_time']
        date = show['date']
        df_rows.append({
                'cinema_id': cinema_id,
                'film_id': film_id,
                'date': date,
                'start_time': start_time
            })
            
df = pd.DataFrame(df_rows)
df.head()

Unnamed: 0,cinema_id,film_id,date,start_time
0,8941,227902,2018-09-14,14:30
1,8941,227902,2018-09-15,15:45
2,8941,123456,2018-09-15,17:05
3,8941,123456,2018-09-16,19:25
4,8941,123456,2018-09-17,18:00


Export this DataFrame as `JSON` as save as a file named `export_df.json`.

In [23]:
json_str = df.to_json()
json.dump(json.loads(json_str),
          open("export_df.json", "w"))

pd.read_json("export_df.json")

Unnamed: 0,cinema_id,film_id,date,start_time
0,8941,227902,2018-09-14,2021-11-29 14:30:00
1,8941,227902,2018-09-15,2021-11-29 15:45:00
2,8941,123456,2018-09-15,2021-11-29 17:05:00
3,8941,123456,2018-09-16,2021-11-29 19:25:00
4,8941,123456,2018-09-17,2021-11-29 18:00:00
5,8941,123456,2018-09-17,2021-11-29 21:05:00
6,8941,123456,2018-09-18,2021-11-29 10:05:00


Like `.load()` and `.loads()`, `dump` works the same.
- `json.dump()` writes a JSON object.
- `json.dumps()` writes a JSON to a string (i.e `json.dumpSTRING()`)

Since we want to output a JSON object, we have used `.dump()`

# Challenge: 
Extract a dataframe that consists of `cinema_id`, `film_id`, `session_type`, `date`, `start_time`. 

There are 2 types of `session_type`: VMAX and standard; `date` and `start_time` depends on `session_type`

In [24]:
with open('challenge_api.json', 'r') as f:
    challenge_api = json.load(f)
    
challenge_api

{'cinema': [{'cinema_id': 8941,
   'cinema_name': 'Vue Cinemas - Reading',
   'showings': [{'film_id': 227902,
     'film_name': 'The Predator',
     'times': {'standard': [{'start_time': '14:30', 'date': '2018-09-14'}],
      'VMAX': [{'start_time': '15:45', 'date': '2018-09-15'}]}},
    {'film_id': 123456,
     'film_name': 'Avengers',
     'times': {'VMAX': [{'start_time': '17:05', 'date': '2018-09-15'},
       {'start_time': '19:25', 'date': '2018-09-16'}]}}]},
  {'cinema_id': 8932,
   'cinema_name': 'Another Cinemas',
   'showings': [{'film_id': 227902,
     'film_name': 'The Predator',
     'times': {'VMAX': [{'start_time': '14:30', 'date': '2018-09-14'}],
      'standard': [{'start_time': '15:45', 'date': '2018-09-15'}]}},
    {'film_id': 123456,
     'film_name': 'Avengers',
     'times': {'standard': [{'start_time': '17:05', 'date': '2018-09-15'},
       {'start_time': '19:25', 'date': '2018-09-16'}]}}]}]}

In [25]:
### SOLUTION
df_rows = []

# for each cinema
for cinema in challenge_api['cinema']:
    cinema_id = cinema['cinema_id']
    
    # for each show
    for show in cinema['showings']:
        film_id = show['film_id']
        
        # for each session type
        for session_type in show['times']:
            # get the possible times
            session_times = show['times'][session_type]
            
            # for each possible starting time
            for time in session_times:
                # get the start and date
                start_time, date = time.values()
                df_rows.append({'cinema_id': cinema_id,
                                'film_id': film_id,
                                'session_type': session_type,
                                'date': date,
                                'start_time': start_time
                })
            
df = pd.DataFrame(df_rows)
df

Unnamed: 0,cinema_id,film_id,session_type,date,start_time
0,8941,227902,standard,2018-09-14,14:30
1,8941,227902,VMAX,2018-09-15,15:45
2,8941,123456,VMAX,2018-09-15,17:05
3,8941,123456,VMAX,2018-09-16,19:25
4,8932,227902,VMAX,2018-09-14,14:30
5,8932,227902,standard,2018-09-15,15:45
6,8932,123456,standard,2018-09-15,17:05
7,8932,123456,standard,2018-09-16,19:25
