# Parsing XML Udacity
## Refer [Geek for Geeks : XML Parsing in python](http://www.geeksforgeeks.org/xml-parsing-python/)
JSON maps are perfectly onto parsing dictionaries and arrays. Parsing XML is more complex.

## XML Tree Structure
XML documents form a tree structure that starts at "the root" and branches to "the leaves".  
<img src="XML Tree.PNG">

XML documents are formed as element trees.

An XML tree starts at a root element and branches from the root to child elements. The terms parent, child, and sibling are used to describe the relationships between elements.

Parent have children. Children have parents. Siblings are children on the same level (brothers and sisters).

All elements can have text content (Harry Potter) and attributes (category="cooking").

All elements can have sub elements (child elements):
<img src="xml2.PNG">



## XML uses a much self-describing syntax.


<img src = "xml3.PNG">

The XML prolog is optional. If it exists, it must come first in the document.

XML documents can contain international characters, like Norwegian øæå or French êèé.

To avoid errors, you should specify the encoding used, or save your XML files as UTF-8.

<img src = "xml4.PNG">

## Parsing XML into a document tree

Here we are going to read entire XML tree into memory.

In [1]:
import xml.etree.ElementTree as ET
import pprint

In [2]:
# Using ET we can parse data in couple of different ways
tree = ET.parse('exampleResearchArticle.xml') 
root = tree.getroot()  # from tree we are getting the root element

In [7]:
root.tag

'art'

In [8]:
root.attrib

{}

In [9]:
type(root)

xml.etree.ElementTree.Element

In [3]:
# Here, iterating over children over root element
print("\nChildren of root")
for child in root:
    print(child.tag)  # Use tag attribute to print out the tag attribute of each child element
    


Children of root
ui
ji
fm
bdy
bm


In [12]:
for child in root.findall('./fm/bibl/'):   # can be used to find the child elements of a root child
    print(child.tag)

title
aug
insg
source
issn
pubdate
volume
issue
fpage
url
xrefbib


### Trying to extract the title of the article.
Its found in the bibligraphy section of front matter(fm).

Element tree supports basic x-path expressions coz in data wrangling we are pulling most of the data out from the XML document.

Here we are using an x-path expression __('./fm/bibl/title')__ to show where i expect to find a title expression. __'.'__ means start at current element and work your way down from fm/bibl/title.

In this, file all text elements are wrapped in paragraph texts. With the title element got from the x-path expression, we will iterate over the children of title, and only take the text of the title using __ .text__ .

In [16]:
title = root.find('./fm/bibl/title/p')
p.text
#title_text = ""
#for p in title:
#    title_text += p.text
#print("\nTitle : \n",title_text)

'Standardization of the functional syndesmosis widening by dynamic U.S examination'

In [5]:
print("\nAuthor email addresses : ")
for a in root.findall('./fm/bibl/aug/au'): # Here, findall will return all elemnets which matches the x-path expression
    email = a.find('email')
    if email is not None:
        print(email.text)


Author email addresses : 
omer@extremegate.com
mcarmont@hotmail.com
laver17@gmail.com
nyska@internet-zahav.net
kammarh@gmail.com
gideon.mann.md@gmail.com
barns.nz@gmail.com
eukots@gmail.com


###  To extract data from xml on authors of an article. The data for each author will be stored in a python dictionary and all the dictionary must be stored in a list. (ignore insr tag)

In [6]:
authors = []
for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr" : []
        }
        
        data['snm'] = author.find('snm').text # since the names and email for authors are unique for each author, we use "find".
        data['fnm'] = author.find('fnm').text  # The text method can be used to get the value of the tag.
        data['email'] = author.find('email').text
        insr = author.findall('./insr')  # since "insr" contains several values, we use "findall" and iterate the returned list.
        for i in insr:
            data["insr"].append(i.attrib["iid"]) # access attributes of a tag by method "attrib" and use attribute name "iid"
        authors.append(data)



pprint.pprint(authors)

[{'email': 'omer@extremegate.com',
  'fnm': 'Omer',
  'insr': ['I1'],
  'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com',
  'fnm': 'Mike',
  'insr': ['I2'],
  'snm': 'Carmont'},
 {'email': 'laver17@gmail.com',
  'fnm': 'Lior',
  'insr': ['I3', 'I4'],
  'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net',
  'fnm': 'Meir',
  'insr': ['I3'],
  'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com',
  'fnm': 'Hagay',
  'insr': ['I8'],
  'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com',
  'fnm': 'Gideon',
  'insr': ['I3', 'I5'],
  'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com',
  'fnm': 'Barnaby',
  'insr': ['I6'],
  'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'insr': ['I7'], 'snm': 'Kots'}]
