# XML and JSON

This chapter gives an overview of processing XML using various
Python modules. Each approach has some advantages and
disadvantages, so no single approach will necessarily work in all
cases.
We also look at the serialization mechanism JSON.


## Structured data

We are all aware of the different types of data out there, some more than others, depending on your level of interaction with it all. 

If you are going to extract some data from a system and send it to
someone (or something!) you probably need to think about how
you are going to format the data.

It usually needs to be readable by a computer program and it also
makes sense to make it human readable so that whoever is going
to process the data has an idea what each item of data represents.
Formatting data is all about adding extra context (meta data) to
the data itself adding extra meaning to it.


In [1]:
#Comma separated variables
data = '''
title,price,author
"Fly Fishing 101”,25.99,"Fred Bloggs"
"Skiing on a Budget”,299.99,"Itsa Joke"
'''

In [2]:
#JSON
data = [{"title":"Fly Fishing 101","price":25.99,"author":"Fred Bloggs"},
{"title":"Skiing on a Budget","price":299.99,"author":"Itsa Joke"}]


In [3]:
#YAML
data = ''' 
- {author: Fred Bloggs, price: 25.99, title: Fly Fishing 101}
- {author: Itsa Joke, price: 299.99, title: Skiing on a Budget}
'''

In [4]:
# XML
data = '''
<books>
<book title="Fly Fishing 101” price="25.99" author="Fred"/>
<book title="Skiing on a Budget” price="299.99" author="Dave"/>
</books>
'''

Of the two it's worth reviewing 
XML, JSON, that's not to discount CSV which really is heavily used in contexts such as data analysis et al. 


## XML
from the folder pull the books.xml and books.dtd into view, note that the two are not connected

### Elements are related 
The document root is the outermost data element, excluding the
XML declaration, DTD header and other XML-specific components.

The APIs often refer to sibling and child elements. Siblings are at
the same level as the current-processed element, child elements
are beneath the currently processed element. The exact
relationship changes as we traverse the tree.
A leaf node is a completely empty leaf element.

There are several modules in python that leverage this relationship and are therefore capable of working with 
documents in this formal format 

python has built-in modules for this but there are others, A more complete, and indeed, a super-implementation, is included
in the (optional) lxml module, obtainable from the Python Package
Index. This also covers XML validation against DTD and Schema.
This will need to be built and installed into your current Python
library on your specific platform. We do not cover this module in
detail in this chapter, but check the documentation at
www.lxml.de. 

Be awre though XML has vulnerabilities 

As mentioned there are multiple ways in which XML can be handled from a module perspective, expat is one for brevity we will look at SAX which may use expat

## SAX
Simple API for XML works in a similar manner to expat, but the
callbacks have different names and there may be more features.
Depending on the implementation, SAX may call expat to parse
the XML file 

Both the two APIs are useful for rapid parsing and reading of large
XML documents as they do not load the entire structure into
memory, they simply treat it as a stream of events. As each event is
encountered the appropriate handler is called.

Unlike DOM, SAX does not load an entire XML document into
memory. This makes it suitable for scanning and reading data
from a large file, but does not lend itself to easy modification and
re-writing of the data back to a file.
Store any required information as normal Python objects, or build
your own object class, as appropriate.

In essence though we will define a content handler, this can be simple or
more complex based on need

Lets see it in action

In [5]:
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.text = ''
    def startElement(self, name, attributes):
        self.tag = name
        if attributes.items(): 
         print(attributes.items())
    def characters(self, data):
        if not data.isspace(): 
            self.text += data
    def endElement(self, name):
        if self.text:
            print(self.tag,':',self.text)
            self.text = self.tag = ''

In [6]:
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse("books.xml")

[('id', 'bk101')]
author : Gambardella, Matthew
title : XML Developer's Guide
genre : Computer
price : 44.95
publish_date : 2000-10-01
description : An in-depth look at creating applications      with XML.
[('id', 'bk102')]
author : Ralls, Kim
title : Midnight Rain
genre : Fantasy
price : 5.95
publish_date : 2000-12-16
description : A former architect battles corporate zombies,      an evil sorceress, and her own childhood to become queen      of the world.
[('id', 'bk103')]
author : Corets, Eva
title : Maeve Ascendant
genre : Fantasy
price : 5.95
publish_date : 2000-11-17
description : After the collapse of a nanotechnology      society in England, the young survivors lay the      foundation for a new society.
[('id', 'bk104')]
author : Corets, Eva
title : Oberon's Legacy
genre : Fantasy
price : 5.95
publish_date : 2001-03-10
description : In post-apocalypse England, the mysterious      agent known only as Oberon helps to create a new life      for the inhabitants of London. Sequel to

### DOM
Parsing with DOM
The DOM approach loads the entire XML document into memory
as a tree of linked node objects.
The relationship between the nodes is provided by attributes that
may be accessed or iterated over.
Each node type provides methods appropriate to handle the
node's data and relationships.


Minidom is a simple implementation of the full DOM specification. Parsing an XML string or a file (or Python file object) using DOM will create a document root node object with child nodes created beneath, representing the XML layout with tags from the input document. As usual, use 'dir(node_object)' to see the list of attributes and methods it supplies.

In [7]:
import xml.dom.minidom
doc = xml.dom.minidom.parse('books.xml')
print(doc.childNodes)
print(doc.firstChild.tagName)

[<DOM Element: catalog at 0x1f76b23bc10>]
catalog


In [8]:
for node in doc.childNodes:
    if node.nodeType == doc.ELEMENT_NODE:
        print(node.nodeName, "\n", node.childNodes)

catalog 
 [<DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b245d30>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b2551f0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b2555e0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b2559d0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b255dc0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b2591f0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b259670>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b259a60>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b259ee0>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b25f310>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b25f790>, <DOM Text node "'\n   '">, <DOM Element: book at 0x1f76b25fb80>, <DOM Text node "'\n'">]


Another way of doing this 

In [9]:
book1 = doc.firstChild.firstChild.nextSibling
print(book1.childNodes[1:-1:2])

[<DOM Element: author at 0x1f76b251e50>, <DOM Element: title at 0x1f76b251ee0>, <DOM Element: genre at 0x1f76b251f70>, <DOM Element: price at 0x1f76b255040>, <DOM Element: publish_date at 0x1f76b2550d0>, <DOM Element: description at 0x1f76b255160>]


Can iterate through everything, the nodes and their attributes

In [10]:
for book in doc.getElementsByTagName('book'):
        print(book.getAttributeNode('id').nodeValue)
        for child in book.childNodes:
            if child.nodeType == book.ELEMENT_NODE:
                for detail in child.childNodes:
                    print(detail.data)

bk101
Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications
      with XML.
bk102
Ralls, Kim
Midnight Rain
Fantasy
5.95
2000-12-16
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
bk103
Corets, Eva
Maeve Ascendant
Fantasy
5.95
2000-11-17
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
bk104
Corets, Eva
Oberon's Legacy
Fantasy
5.95
2001-03-10
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
bk105
Corets, Eva
The Sundered Grail
Fantasy
5.95
2001-09-10
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.
bk106
Randall, Cynthia
Lover Birds
Romance
4.95
2000-09-02
When Carla meets Paul

## Using etree
The etree approach is very Pythonic. It uses standard Python 
interfaces such as iterators, lists and dictionaries to implement an 
iterable tree of objects. The ElementTree object encompasses the 
entire XML document and may be iterated through using the 
.iter*() and .find*() methods. Iteration takes place from top to 
bottom, first child first, then siblings' children (depth-first).
The document root is an Element object and may also be iterated 
through using similar methods to the ElementTree. In addition, 
Element objects have list-like properties and methods as well as 
data attributes such as tag, tail, text and attrib, the latter being 
implemented as a dictionary of node attributes.
New sub-element objects may be created and attached to the tree 
using the etree.SubElement() factory function.
The Python documentation for xml.etree.ElementTree contains a 
short tutorial (19.7.1), see docs.python.org/2/library (or the 
corresponding Python 3 link).

In [13]:
import xml.etree.ElementTree as ET
tree = ET.parse('books.xml')
root = tree.getroot()
for book in root.findall('book'):
    for item in book:
        print(item.tag, ":", item.text)

author : Gambardella, Matthew
title : XML Developer's Guide
genre : Computer
price : 44.95
publish_date : 2000-10-01
description : An in-depth look at creating applications
      with XML.
author : Ralls, Kim
title : Midnight Rain
genre : Fantasy
price : 5.95
publish_date : 2000-12-16
description : A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
author : Corets, Eva
title : Maeve Ascendant
genre : Fantasy
price : 5.95
publish_date : 2000-11-17
description : After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
author : Corets, Eva
title : Oberon's Legacy
genre : Fantasy
price : 5.95
publish_date : 2001-03-10
description : In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
author : Corets, Eva
title : The Sundered

As well as iterating through elements, because they display list-like 
properties, we can also index into them.
The document root Element equates to the <books> node. 
Therefore root[0] is an Element denoting the first <book> node and 
root[1] is an Element denoting the second <book> node. book2[2] is 
therefore the price SubElement of book2.
In this example we have created a SubElement representing a new 
child of the book1 node, and we have updated the text of the price 
SubElement of the book2 node. 
Note that the empty dictionary '{ }' represents the (in this case, 
non-existent) attributes of the new node.


In [14]:
book1 = root[0]
pub = ET.SubElement(book1, 'Published', {})
pub.text = 'Today'
for item in book1.iter():
    print(item.text)
book2 = root[1]
book2[2].text = '£12.00'
for item in book2.iter():
    print(item.text)



      
Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications
      with XML.
Today

      
Ralls, Kim
Midnight Rain
£12.00
5.95
2000-12-16
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.


## JSON
Transferring data can be a cumbersome task. XML is all well and
good, however it requires a DOM parser in order to read/write and
is not easily realised into object format.
JavaScript Object Notation (or JSON) is a lightweight data
interchange format which is easy to read and write and more
importantly, it’s easy for machines to parse and to generate.
JSON is a text format that is programming language independent
and uses conventions familiar to most programmers. 

Universal data structures supported by most programming languages
A collection of name/value pairs
Realised as an object (associative array/hash/dictiomary)
An ordered list of values
Realised as an array/list
JSON object
Unordered set of name/value pairs
Begins with { (left brace) and ends with } (right brace)
Each name followed by a : (colon)
Name/value pairs separated by a , (comma)




In [15]:
import json

with open('books.json', 'r') as f:
    data = json.load(f)
    print(data['books'])

[{'isbn': '9781593275846', 'title': 'Eloquent JavaScript, Second Edition', 'subtitle': 'A Modern Introduction to Programming', 'author': 'Marijn Haverbeke', 'published': '2014-12-14T00:00:00.000Z', 'publisher': 'No Starch Press', 'pages': 472, 'description': 'JavaScript lies at the heart of almost every modern web application, from social apps to the newest browser-based games. Though simple for beginners to pick up and play with, JavaScript is a flexible, complex language that you can use to build full-scale applications.', 'website': 'http://eloquentjavascript.net/'}, {'isbn': '9781449331818', 'title': 'Learning JavaScript Design Patterns', 'subtitle': "A JavaScript and jQuery Developer's Guide", 'author': 'Addy Osmani', 'published': '2012-07-01T00:00:00.000Z', 'publisher': "O'Reilly Media", 'pages': 254, 'description': "With Learning JavaScript Design Patterns, you'll learn how to write beautiful, structured, and maintainable JavaScript by applying classical and modern design patter

In [16]:
import json

person_dict = {'name': 'Bob',
'age': 12,
'children': None
}
person_json = json.dumps(person_dict)

print(person_json)

{"name": "Bob", "age": 12, "children": null}


In [17]:

import json

person_dict = {"name": "Bob",
"languages": ["English", "French"],
"married": True,
"age": 32
}

with open('person.txt', 'w') as json_file:
  json.dump(person_dict, json_file)

### de-serialise to class
It's often useful to de-serialise data, bring it into a class for working with it in different ways. 

In [1]:
import json

class Payload:
    def __init__(self, action, method, data):
        self.action = action
        self.method = method
        self.data = data

# Example JSON string
json_string = '{"action": "print", "method": "onData", "data": "Madan Mohan"}'

# Deserialize to a Payload object
p = Payload(**json.loads(json_string))

# Access the attributes
print(p.action)  
print(p.method)  
print(p.data)    


print
onData
Madan Mohan


## pickles
Python provides pickle modules for Serialization and de-Serialization of python objects like lists, dictionaries, tuples, etc. Pickling is also called marshaling or flattening in other languages. Pickling is used to store python objects.

Serialization or Pickling:
Pickling or Serialization is the process of converting a Python object (lists, dict, tuples, etc.) into byte streams that can be saved to disks or can be transferred over a network.

De-serialization or un pickling:
The byte streams saved on file contains the necessary information to reconstruct the original python object. The process of converting byte streams back to python objects is called de-serialization.

Below are the steps for pickling in python:

Import pickle module. Use pickle.dump(object, filename) method to save the object into file : this will save the object in this file in byte format. Use pickle.load(filename): to load back python object from the file where it was dumped before.


In [18]:
import pickle
# creating python object --> dictionary
dictionary = {1: 'monday', 2: 'tuesday', 3: 'wednesday', 4: 'thursday', 5: 'friday', 6: 'saturday', 7: 'sunday'}
print('Pickling')
# open a file where to store dictionary
print("dictionary to be stored:")
print(dictionary)
with open('dictionary.pkl', 'wb') as file:
    pickle.dump(dictionary, file) # storing dictionary into file

print('\n')
print('Un-pickling')
with open('dictionary.pkl', 'rb') as file:
    unpickled_dict = pickle.load(file)

print("displaying dictionary data")
for key, item in unpickled_dict.items():
    print(key, '-->', item)

Pickling
dictionary to be stored:
{1: 'monday', 2: 'tuesday', 3: 'wednesday', 4: 'thursday', 5: 'friday', 6: 'saturday', 7: 'sunday'}


Un-pickling
displaying dictionary data
1 --> monday
2 --> tuesday
3 --> wednesday
4 --> thursday
5 --> friday
6 --> saturday
7 --> sunday
