# Notebook on how to read/parse data from XML (local file or over the internet)

XML stands for **eXtensible Markup Language**.

It is a markup language for web applications, much like its better-known cousin HTML. It was designed to store and transport data and to be completely self-descriptive. The core idea of XML is prominently displayed in its name – **extensibility**. While HTML has a set of rigid standards and fixed tags, XML lets a designer define her own tags, their order of appearance in a data structure, and even how they should be displayed or processed.

Fundamentally, it is a textual data format with built-in support (via Unicode) for the widest variety of natural languages. Although the design of XML focuses on documentation building, it is widely used for the representation of a variety of data structures which are used in modern day web services, API, RESTful microservices. In that regard, along with the JSON , XML is one of the most frequently encountered data exchange format over the web.

Check this **["A Really, Really, Really Good Introduction to XML"](https://www.sitepoint.com/really-good-introduction-xml/)**

### We will use Python's built-in `XML` module to parse data

In [3]:
import xml.etree.ElementTree as ET

### Create some random data yourself to understand the XML data format better

In [4]:
data = '''
<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>'''

In [5]:
print(data)


<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>


In [6]:
type(data)

str

### Read the string data as an XML `Element` object 

In [7]:
tree = ET.fromstring(data)

In [8]:
type(tree)

xml.etree.ElementTree.Element

### Find various elements of data within the tree (element)

In [9]:
# Print the name of the person
print('Name:', tree.find('name').text)

Name: Dave


In [10]:
# Print the surname
print('Surname:', tree.find('surname').text)

Surname: Piccardo


In [11]:
# Print the phone number
print('Phone:', tree.find('phone').text.strip())

Phone: +1 742 101 4456


In [12]:
# Print email status and the actual email
print('Email hidden:', tree.find('email').get('hide'))
print('Email:', tree.find('email').text.strip())

Email hidden: yes
Email: dave.p@gmail.com


### Read from a local XML file (perhaps downloaded) into an `ElementTree` object

In [17]:
tree2=ET.parse('xml1.xml')

In [18]:
type(tree2)

xml.etree.ElementTree.ElementTree

### How to 'traverse' the tree? Find the `root` and explore all `child` nodes and their `attributes`

In [19]:
root=tree2.getroot()

In [20]:
for child in root:
    print ("Child tag:",child.tag, "| Child attribute:",child.attrib)

Child tag: country1 | Child attribute: {'name': 'Norway'}
Child tag: country2 | Child attribute: {'name': 'Austria'}
Child tag: country3 | Child attribute: {'name': 'Israel'}


### Use the `.text()` method to extract meaningful data

In [28]:
root[0][2]

<Element 'gdppc' at 0x00000000051FF278>

In [26]:
root[0][2].text

'70617'

In [22]:
root[0][2].tag

'gdppc'

In [23]:
root[0]

<Element 'country1' at 0x00000000050298B8>

In [24]:
root[0].tag

'country1'

In [25]:
root[0].attrib

{'name': 'Norway'}

### Write a loop to extract and print the GDP/per capita information against each country 

In [26]:
for c in root:
    country_name=c.attrib['name']
    gdppc = int(c[2].text)
    print("{}: {}".format(country_name,gdppc))

Norway: 70617
Austria: 44857
Israel: 38788


### Find all the neighboring countries for each country and print them
Note how to use `findall` and `attrib` together

In [35]:
for c in root:
    ne=c.findall('neighbor') # Find all the neighbors
    print("Neighbors\n"+"-"*25)
    for i in ne: # Iterate over the neighbors and print their 'name' attribute
        print(i.attrib['name'])
    print('\n')

Neighbors
-------------------------
Sweden


Neighbors
-------------------------
Germany
Hungary
Italy
Switzerland


Neighbors
-------------------------
Lebanon
Syria




### A simple demo of using XML data obtained by web scraping

In [27]:
import urllib.request, urllib.parse, urllib.error

In [28]:
serviceurl = 'http://www.recipepuppy.com/api/?'

In [31]:
item = str(input('Enter the name of a food item (enter \'quit\' to quit): '))
url = serviceurl + urllib.parse.urlencode({'q':item})+'&p=1&format=xml'
uh = urllib.request.urlopen(url)

Enter the name of a food item (enter 'quit' to quit): chicken tikka


In [32]:
data = uh.read().decode()
print('Retrieved', len(data), 'characters')
tree3 = ET.fromstring(data)

Retrieved 2611 characters


In [33]:
type(tree3)

xml.etree.ElementTree.Element

In [34]:
for elem in tree3.iter():
    print(elem.text)





Chicken Tikka Masala
http://allrecipes.com/Recipe/Chicken-Tikka-Masala/Detail.aspx
black pepper, chicken, butter, cayenne, cinnamon, cumin, cumin, garlic, heavy cream, jalapeno, lemon juice, paprika, salt, salt, yogurt


Chicken Tikka With Chickpea Flour
http://www.recipezaar.com/Chicken-Tikka-With-Chickpea-Flour-224938
chicken, chickpea flour, chili powder, cumin, garlic, ginger, lemon juice, nutmeg, salt, turmeric


Chicken Tikka Masala
http://www.recipezaar.com/Chicken-Tikka-Masala-289402
black pepper, chicken, tomato, cayenne, chicken broth, garam masala, garlic, ginger, cardamom, cinnamon, coriander, cumin, onions, paprika, yogurt, salt, tomato paste, turmeric, vegetable oil


Chicken Tikka Masala Recipe
http://www.grouprecipes.com/37802/chicken-tikka-masala.html
cumin, garam masala


Chicken Tikka Masala
http://www.recipezaar.com/Chicken-Tikka-Masala-166811
chicken, butter, cayenne, cilantro, ginger, black pepper, garam masala, garlic, cinnamon, cumin, cumin, heavy cream, jal

### How does the raw data look like?

In [37]:
print(data)

<?xml version="1.0"?>
<recipes>
<recipe>
<title>Chicken Tikka Masala</title>
<href>http://allrecipes.com/Recipe/Chicken-Tikka-Masala/Detail.aspx</href>
<ingredients>black pepper, chicken, butter, cayenne, cinnamon, cumin, cumin, garlic, heavy cream, jalapeno, lemon juice, paprika, salt, salt, yogurt</ingredients>
</recipe>
<recipe>
<title>Chicken Tikka With Chickpea Flour</title>
<href>http://www.recipezaar.com/Chicken-Tikka-With-Chickpea-Flour-224938</href>
<ingredients>chicken, chickpea flour, chili powder, cumin, garlic, ginger, lemon juice, nutmeg, salt, turmeric</ingredients>
</recipe>
<recipe>
<title>Chicken Tikka Masala</title>
<href>http://www.recipezaar.com/Chicken-Tikka-Masala-289402</href>
<ingredients>black pepper, chicken, tomato, cayenne, chicken broth, garam masala, garlic, ginger, cardamom, cinnamon, coriander, cumin, onions, paprika, yogurt, salt, tomato paste, turmeric, vegetable oil</ingredients>
</recipe>
<recipe>
<title>Chicken Tikka Masala Recipe</title>
<href>htt

### How to build a simple list of all hyperlinks (receipe pages) present in the data?

In [70]:
for e in tree3.iter():
    h=e.find('href')
    t=e.find('title')
    if h!=None and t!=None:
        print("Receipe Link for:",t.text)
        print(h.text)
        print("-"*100)

Receipe Link for: Chicken Tikka Masala
http://allrecipes.com/Recipe/Chicken-Tikka-Masala/Detail.aspx
----------------------------------------------------------------------------------------------------
Receipe Link for: Chicken Tikka With Chickpea Flour
http://www.recipezaar.com/Chicken-Tikka-With-Chickpea-Flour-224938
----------------------------------------------------------------------------------------------------
Receipe Link for: Chicken Tikka Masala
http://www.recipezaar.com/Chicken-Tikka-Masala-289402
----------------------------------------------------------------------------------------------------
Receipe Link for: Chicken Tikka Masala Recipe
http://www.grouprecipes.com/37802/chicken-tikka-masala.html
----------------------------------------------------------------------------------------------------
Receipe Link for: Chicken Tikka Masala
http://www.recipezaar.com/Chicken-Tikka-Masala-166811
------------------------------------------------------------------------------------