# Introduction to XML Data Format

-----

## Introduction

We have already touched upon data formats in the context of data
persistence. But one of the most important tasks when starting a data
analysis project is understanding the format of a data file and how to
best extract the necessary information from the data, whatever the
format. In this notebook, we explore the XML data format, and present
how to read and write data in this format by using standard, built-in
Python tools.

-----



## Table of Contents

[XX](#XX-YY)

-----

Before proceeding with the rest of this notebook, we first have our standard notebook setup code.

-----

Before we begin, however, we need to read in test data to be able to
have data that we can write and read to an XML format.

This Notebook will only work after the [Text Data Format][tdf] notebook
has been successfully completed.

-----

[tdf]: text-dataformat.ipynb

In [1]:
# Note, that we implicitly assume the text-dataformat notebook has already been executed
# which will create the temp directory and download the airports.csv data
import csv

airports = []

with open('/home/data_scientist/temp/airports.csv', 'r') as csvfile:
    
    for row in csv.reader(csvfile, delimiter=','):
        airports.append(row)

print(airports[0:3])

[['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], ['00M', 'Thigpen ', 'Bay Springs', 'MS', 'USA', '31.95376472', '-89.23450472'], ['00R', 'Livingston Municipal', 'Livingston', 'TX', 'USA', '30.68586111', '-95.01792778']]


-----
[[Back to TOC]](#Table-of-Contents)


## XML

[Extensible Markup Language][xml], or XML, is a simple, self-describing
text-based data format. XML is a standard developed by the W3C, or
World-Wide Web Consortium, originally for large scale publishing, but
with the growth of the web it has taken on new roles. XML is based on
the concept of element, that can have attributes and values. Elements
can be nested, which can indicate parent-child relationships or a form of
containerization. While you may not ever deal directly with XML files,
you wil interact with other data formats that are based on XML, such as
the latest version of HyperTextMarkup Language (HTML5) or Scalable
Vector Graphics format (SVG).

Given its structured format, you don't simply read an XML document, you
must parse the document to build up a model of the elements and their
relationships. The [`ElementTree`][xmlpy] parsing model is implemented
within the standard Python distribution in the `xml` library. T0 write
an XML file, we simply need to create an instance of this, for example
by passing a string into the class constructor, and then writing this
XML encoded data to a file. One caveat with this entire process,
however, is that the following five characters: `<`, `>`, `&`, `'`, and
`"` are used by the actual markup language, they must be replaced by
their corresponding _entity code_. For these five characters, that can
be easily done by using the `html`.escape` method as shown in the
following code cell.

-----
[xml]: http://www.w3.org/XML/
[w3c]: http://www.w3.org
[html5]: http://www.w3.org/TR/html5/
[svg]: http://www.w3.org/Graphics/SVG/
[xmlpy]: https://docs.python.org/3/library/markup.html

In [2]:
import html 
import xml.etree.ElementTree as ET

data = '<?xml version="1.0"?>\n' + '<airports>\n'

for airport in airports[1:]:
    data += '    <airport name="{0}">\n'.format(html.escape(airport[1]))
    data += '        <iata>' + str(airport[0]) + '</iata>\n'
    data += '        <city>' + str(airport[2]) + '</city>\n'
    data += '        <state>' + str(airport[3]) + '</state>\n'
    data += '        <country>' + str(airport[4]) + '</country>\n'
    data += '        <latitude>' + str(airport[5]) + '</latitude>\n'
    data += '        <longitude>' + str(airport[6]) + '</longitude>\n'

    data += '    </airport>\n'

data += '</airports>\n'

tree = ET.ElementTree(ET.fromstring(data))


with open('/home/data_scientist/temp/data.xml', 'w') as fout:
    tree.write(fout, encoding='unicode')


-----

Since the XML format is text based, we can easily view the contents of
our new XML file by using the `head` command, as done before. In this
case, the XML format is our own creation, but if we were following a
standard, additional information would be present to indicate the full
document provenance.

-----

In [3]:
!head -9 /home/data_scientist/temp/data.xml

<airports>
    <airport name="Thigpen ">
        <iata>00M</iata>
        <city>Bay Springs</city>
        <state>MS</state>
        <country>USA</country>
        <latitude>31.95376472</latitude>
        <longitude>-89.23450472</longitude>
    </airport>


-----

As the XML document contents demonstrate above, the XML format can be
quite verbose. However, the document's contents are clearly visible and
are easily understood. This enables an XML document to be [parsed][ps]
based on a rough knowledge of the document. First we need to create and
`ElementTree` object and parse the contents of the document, which we
can do with the `parse` method and passing in the name of our XML
document file. 

When parsing an XML document, we have a tree model for the XML elements
contained in the document. The base of this model is the _root_ element,
which is returned by the `parse` method. While there are a number of
methods that can be used to find or iterate through elements in the
document, in our case we simply want to process each `airport` element;
thus we use the `findall` method to find all `airport` elements. The
child elements of each `airport` element can be accessed like a Python
`list`. The text within an element is accessed by requesting the `text`
attribute for that element, while an element attribute is accessed like
a `dictionary` where the name of the attribute acts as the _key_ to
request a particular _value_. These techniques are demonstrated in the
next code cell, where we read in our new XML document, and extract the
airport information.

-----

[ps]: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml

In [4]:
data = [["iata", "airport", "city", "state", "country", "lat", "long"]]

tree = ET.parse('/home/data_scientist/temp/data.xml')
root = tree.getroot()

for airport in root.findall('airport'):
    row = []
    row.append(airport[0].text)
    row.append(airport.attrib['name'])
    row.append(airport[1].text)
    row.append(airport[2].text)
    row.append(airport[3].text)
    row.append(airport[4].text)
    row.append(airport[5].text)

    data.append(row)
    
print(data[:5])

[['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], ['00M', 'Thigpen ', 'Bay Springs', 'MS', 'USA', '31.95376472', '-89.23450472'], ['00R', 'Livingston Municipal', 'Livingston', 'TX', 'USA', '30.68586111', '-95.01792778'], ['00V', 'Meadow Lake', 'Colorado Springs', 'CO', 'USA', '38.94574889', '-104.5698933'], ['01G', 'Perry-Warsaw', 'Perry', 'NY', 'USA', '42.74134667', '-78.05208056']]


-----

The preceding data formats: fixed-width, delimiter separated value,
JSON, and XML are the primary text-based data formats that data
scientists need to be able to use. While easy to read and relatively
easy to parse, they are not always the best solution, especially for
large, numerical data. While specialized binary formats exist, which are
often domain-specific formats, there is one widely used format that
continues to gain ground in data science applications.

-----

-----

<font color='red' size = '5'> Student Exercise </font>

Earlier in this notebook, we used XYZ. By using the preceding Code cells, try to make the following changes to see if your ability to identify outliers is improved.

3. Try 
4. Try adding 
2. Change the 

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [XML Tutorial][1] by W3Schools.
2. [HTML Tutorial][2], an XML specified document language, by W3Schools.
3. [SVG Tutorial][3], an XML specified image language, by W3Schools.
4. A nice introduction on [][41]
1. An overview of [][42]

-----

[1]: http://www.w3schools.com/xml/default.asp
[2]: http://www.w3schools.com/html/default.asp
[3]: http://www.w3schools.com/svg/default.asp

[41]: http://
[42]: https://
[43]: https://


**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode